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EXECUTIVE SUMMARY 


The electronic subsystems of future overhead collection 
platforms will require extremely high performance digital 
ogre men § performing such tasks as data 
compression/decompression, data encryption, spread spectrum 
Medulaction, etc. Lo aceomplish tars, bit rates must reach 
into the gigabits per second range. Such speed obviously 
Beqaares dagital logic which “willl function comrecely “ae 
clock rates of tens of gigahertz. The need for such high 
performance has led to the implementation of logic systems 
using indium phosphide (InP) heterojunction bipolar 
transistors (HBT) technology. However, clock frequency and 
pipeline throughput in digital systems implemented with InP 
Hein iteChnevOc ves Siscmliemma tT 1Gantilen litasked sy clock, pceontrell 
Signal, and data skew which is a much larger percentage of 
the clock period than it is in lower-speed digital systems 
implemented with complementary metal oxide semiconductor 
(CMOS) technology. Therefore, the presence of clock skew in 
high-speed digital systems defines a limitation for the 


advantages of synchronous pipelined architectures. 


It 1s the purpose of this thesis to design a 
synchronous 8x8 bit pipelined multiplier as a high-speed 
eniga ca | test circuit using InP HBT technology ' and 
furthermore, to quantify the impact of clock skew on 
lately ita lite This work represents the initial phase of a 
larger research project to determine if asynchronous 
pipeline control will yield greater overall pipeline 
throughput in high-performance InP HBT digital integrated 
circuits and if the resulting elimination of the clock 
distribution tree will reduce power consumption, device 
count and layout area. All simulation data is based upon 


the results obtained from Tanner SPICE design tools. 


Having received InP HBT device specifications from 
Hughes Research Laboratories, this project commenced with 
the design of an HBT logic family utilizing current-mode 
Poqne. Fach circuit was designed and optimized for a 
minimum power-delay product while driving a maximal fanout 
load of four logic gates. This design effort produced the 
four essential circuit functions necessary for the practical 
implemention of any synchronous logic circuit: an 
inverter/buffer gate, an OR/NOR gate, a D-type latch, anda 


practical current source. 


Using the building blocks of this logic family, an 
array multiplier was constructed and further configured into 
five distinct pipeline implementations. These included a 
one, two, four, six, and ten-stage pipeline, respectively. 
A comparative analysis of their performance effectively 
illustrated the trade-offs of pipelining, i.e., the cost of 
the additional registers was shown to outpace the increase 
in throughput beyond a six-stage implementation. At a 
Nepali thzelgheut Of 4-555 dlGgaiertz, the six-stage 
pipelined multiplier was the most efficient design (in the 
absence of clock skew). The highest throughput achieved was 
5.56 gigahertz by the costly ten-stage implementation. 


Power consumption ranged from 4.4 to 14 watts. 


In the final analysis, clock skew was not simulated 
because SPICE simulations effectively eliminate skew from 
their calculations. Rather, the impact of clock skew was 
determined by applying numerical analysis to the no-skew 
Simulation results. A range of possible skew values was 
considered in order to demonstrate a performance trend. The 
results confirmed that digital system throughput rates which 
are obtained as a function of higher clock rates will 
experience the most drastic performance reductions in the 


presence of clock skew. Also, it was shown for a typical 


X 


Value @or Skew in this circuit that the efficiency curve 
shifts to indicate that the four-stage pipeline is the most 


efficient implementation, vice the six-stage pipeline. 


The design products and test results from this thesis 
provide a reference point for further research into 
alternative clocking/control techniques. Specifically, it 
1s intended that future research use the CML HBT logic 
family designed in this thesis in order to implement the 
same array multiplier circuit using asynchronous control 
techniques. One such endeavor is already in progress as 
LtCol. Kirk Shawhan, USMC, investigates the use of local 
completion Signals which employ request/acknowledge 
handshake signals to control the flow of data vice the use 


of a global clock signal. 
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I. INTRODUCTION 


A. THE RELEVANCE OF HIGH-SPEED LOGIC 

The demand for increased processing speeds in digital 
electronics has driven the clock frequency of logic circuits 
from a scale of microseconds to one of picoseconds over the 
past twenty years. This remarkable trend is the synergistic 
result of technological advancements and innovations in 
device physics, very-large-scale integrated (VLSI) circuit 
fabrication, and digital systems architecture. Moore’s Law 
accurately predicted this trend of improvement 35 years ago, 


and current expectations are that the trend will continue 


(Moore, 1997). Consider the anticipation of such 
technologies as real-time multimedia satellite 
communications and broadband networks. These applications 


will require extremely high performance digital logic that 


can function reliably at clock rates of tens of gigahertz. 


B. THE PROBLEM OF CLOCK SKEW 

There are a variety of technological hurdles to clear 
before achieving such clock speeds, and it is the purpose of 
this thesis to explore one particular hurdle in the course 
of digital systems architecture: the problem of clock skew 
in high-speed logic. Clock skew is the difference between 
arrival times of the clock signal at different synchronous 


clocked devices (Harris, 1999). As clock frequencies reach 


into the multi-gigahertz range, clock skew is an increasing 


concern for high-speed circuit designers because it accounts 
for an increasing portion of the clock period — leaving 


less of the clock period to be budgeted for logic and 
latching delays. What was once a near negligible quantity 
has now become a significant design constraint. (Wakerly, 


2000) 


Cc. THE DESIGN OF A TEST CIRCUIT 

This thesis presents the design of a high-speed logic 
test circuit and the simulation of its performance in order 
to identify and quantify the effects of clock skew. ie 
should be noted that these results are intended to serve as 
a reference for future research involving potential 
solutions for the reduction of clock skew. The following 
paragraphs develop the necessary specifications of the test 
CLEeumt.. 

To ensure valid results, it is important that the 
problem be simulated in an accurate context. Therefore, it 
is necessary to select a logic family based upon a 
transistor model that is capable of realizing multi- 
gigahertz clock speeds. Although complementary metal-oxide- 
semiconductor (CMOS) technologies dominate VLSI 
applications, for comparable fabrication technologies, a 
bipolar circuit 1S approximately 2.5 times faster than a 


functionally similar CMOS circuit (Foley (1994). Sypr1calie? 


such high-speed bipolar circuits employ emitter coupled 
liegic (ECL) of cumBpent mode Togme (CMB). Notably, these 
logic families consume significantly more power than field 
effect transistor (FET) logic families; however, the trade- 
off is accepted here for the purpose of achieving sufficient 
clock speeds. For these reasons, current mode logic is 
employed to design a family of logic gates based upon the 
transistor specifications for an indium phosphide (InP) 
heterojunction bipolar transistor (HBT), courtesy of Hughes 
Research Laboratories. 

Additionally, it is important that the architecture and 
functionality of the test circuit provide a relevant context 
for evaluation. It should be noted here that the shorter 
clock periods discussed above are not exclusively the result 
of faster gate delays (1.e. faster transistors) but are also 
the result of pipelined architectures which require fewer 
gate delays per clock cycle. In keeping with this 
characteristic of high-speed logic circuits, the test 
circuit implements a pipelined architecture. As for circuit 
functionality, an 8x8 bit multiplier was chosen to provide 


sufficient complexity for pipeline implementation. 


D. THESIS OUTLINE 
The purpose of this thesis is to design, simulate, and 
evaluate the performance of a high-speed (InP HBT) 8x8-bit 


pipelined multiplier in the presence of clock skew. The 


discussion begins with the review and development of several 
fundamental topics in Chapter II: clock skew, pipelining 
principles, logic-level design of a multiplier, and 
transistor-level design of BUJT/HBT logic. Based upon that 
foundation, Chapters III through V present the hierarchical 
design of the pipelined multiplier from the bottom up. 
Respectively, these chapters address logic circuit design, 
clock-driven circuit design, and pipeline design. Each of 
the design chapters presents a complete discussion of 
pertinent design issues, low-level simulation, performance 
optimization, and final design specifications. Finally, 
Chapter VI records the analysis of clock skew and 


Chapter VII summarizes the conclusions of the entire work. 


II. BACKGROUND 


A. CLOCK SKEW 

Clock skew is the difference between the arrival times 
of the clock signal at two different clock-driven devices, 
as illustrated in Figure (2-1). This difference is 
dependent upon multiple issues including normal component 
variations, wire propagation delay, RC delays, propagation 
distance, environmental variations (such aS operating 
temperature), and clock loading. Notably, all of these 
contributing factors have been increasing relative to gate 


Gelays. (Harris, 1999) 


IN 


CLOCK 





Figure 2-1. Clock Skew (After Wakerly). 


In traditional logic designs which employ flip-flops 
and operate at extremely high clock frequencies, clock skew 


has become a significant portion of the total clock period. 


For a fixed-length clock period, this effectively reduces 
the amount of time available for computation. Equation 
(2-1) quantifies the terms which contribute to the minimum 


clock period (T of a traditional synchronous’ logic 


Pee 
Giese ite, 


(2-1) ple, = t 1c 


Flip-Flop 


where, t = + cc 


Flip-FLop setup cop am 


The simplest and most direct technique for minimizing 
clock skew would seem to be the implementation of a uniform 


clock distribution hierarchy which provides a local clock 


Signal to a smaller portion of the entire circuit, 1.e., a 
subcircuit. For signals that remain within the subcircuit, 
clock skew is reduced. The maximum propagation delay from 


the local clock source to the farthest clock input of the 
subcircuit can be kept within a desirable tolerance. But 
inevitably, signals must travel between subcircuits. This 
is an increasingly common occurrence when the maximum size 
of the subcircuit is restricted by practical limitations for 
fanout and power consumption — especially true in the case 
of current-driven logic. 

The local clock signals are not without skew relative 
to each other. Although the delay paths for each branch of 
the clock distribution tree may contain the same number of 


gate delays, the switching behavior along each path varies 


within a narrow range. Thus, when a Signal from one 
subcircuit must drive logic in another subcircuit, the 
worst-case value of  Saeeeie clock skew must be assumed. 

An extensive clock distribution tree 1s employed in 
this thesis to provide local clock signals for circuit 
elements of a pipelined multiplier. Ultimately, the purpose 
is to quantify the clock skew experienced in a high-speed 
logic circuit and explore the impact of clock skew as the 


clock period is reduced. 


B. PRINCIPLES OF PIPELINING 

As referenced in the previous section, the minimum 
clock period is governed by the relationship presented in 
Equation (2-1). For a given block of combinational logic 


with an associated propagation time of t the minimum 


legse* 
clock period is required to be even greater. In the face of 
a large, complex combinational circuit (Figure 2-2a) this 
could impose undesirable restrictions on clock speed. 
However, a pipelined approach suggests that the 
combinational logic can be broken down into discrete levels 
of operation, known as pipeline levels (Figure 2-2b). Each 
Pipeline level will contain fewer levels of logic than the 
original combinational circuit, and ideally, each pipeline 
level will contain the same number of logic levels in order 


to achieve near-equal propagation delays. Then, by adding 


appropriately sized registers between these levels (Figure 


daezc) mthe fumetion of Ghe original combinational togic can 
be achieved by sequentially sending operands through the 
series of pipeline levels. 


Furthermore, this can be done at a higher clock rate 


Since the period is now governed by Equation (2-2), where 
Cees now become te. 
(2-2) Ah = ¢t ‘e te 


skew pipe-level Flip-Flop 


The improvement in clock speed is quantified as_ the 
percentage of speedup, Equation (2-3). (Pollard, 1990) 


(2-3) 


Time for M operations WITHOUT pipelining 
[Deed = SSS 
Time for M operations WITH pipelining 


Of course, this benefit is not without cost. There are 
several trade-offs involved such as increases in the number 
of components, power consumption, control complexity, chip 
area, and a variety of associated costs for design and 
Pals Cais1On.. Additionally, the propagation latency for a 
Single set of signals traveling through the pipeline is 
increased due to the additional delays contributed by the 
intermediate register(s) in the pipeline. Equation (2-4) 
expresses this increase in latency as a function of the 
number of pipeline stages (m) and the total register delay 
(Loomis, 2000). 


(2-4) Lacencyeigercases—mam— 15 ies 
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Figure 2-2. Example of Pipelining (After Loomis). 


S) 


Though the significant increase in delay for a single 
operation may seem to be a tragic loss, it is the remarkable 
increase in data throughput which accompanies the increase 
in clock speed that ultimately motivates the designer to 
adopt a pipelined architecture. 

ie: the context of this project, a pipelined 
architecture will facilitate the achievement of high clock 


speeds in the implementation of a relatively large, complex 


combinational circuit — a combinational multiplier. 


Ge; LOGIC DESIGN OF A COMBINATIONAL MULTIPLIER 
A combinational multiplier takes two n-bit operands and 


performs n shift and n add operations to generate a 2n-bit 


SISeXSlO(re Most algorithms are implemented based upon the 
paper-and-pencil-like procedure Of shifted product 
components as shown in Figure (2-3). Each individual bit of 


the multiplier (y, through y_,) is successively multiplied 
times the entire n-bit multiplicand. With each subsequent 
multiplier bit, the resulting product component is shifted 
by one bit position, starting with an initial shift of zero 
and concluding with n-1. (Wakerly, 2000) 

The worst-case delay for this type of multiplication is 
governed by the carry propagation out oof the most 
Significant bit position and into the follow-on stage of 
loon monmoyel By utilizing carry-save addition (Figure 2-4), 


this propagation delay is eliminated for the initial n-l 
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Figure 2-3. Multiplication as a sum of partial product 
terms (From Wakerly). 
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Figure 2-4. An 8x8 bit multiplier implemented with seven 


carry-save adder stages and one ripple-carry adder for 
carry completion (From Wakerly). 


IEA 


stages of addition; however, an extra stage is required to 
complete the addition of the final two resulting terms, as 
will be explained shortly. 

The first carry-Save addition stage takes two binary 
addends and generates an n-bit modulo-two sum and a shifted 
n-bit carry term (shifted by one bit). Subsequent carry- 
save addition stages take three binary addends: the 
previous partial sum, the shifted carry term, and the next 
subsequent product term. These are also added to produce an 
n-bit modulo~two sum and a shifted n-bit carry term. As 
each carry-save addition occurs, the least significant bit 
(LSB) of each partial sum represents the next most 
Signifieant ~b1t (MSB) imeeenc final product. This Was 
repeated until the n™” product term has been added, and all 
that remains are a sum term and a shifted carry term. At 
this point, a carry-completion adder computes the most 
Significant n+l bits of the product. This procedure 
accounts for the consecutive propagation of a carry bit as 
each pair of addend bits are summed from LSB to MSB. 

In the context of this project, the implementation of 
Carry-save adders and carry completion adders allows 
convenient grouping of pipeline stages. This 1s 
particularly applicable to the final stage of the design 
process undertaken in this project. Chapter 5 provides 


further details on the implementation of a pipelined 8x8-bit 
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combinational multiplier, as introduced in the preceding 


paragraphs. 


D. BJT/HBT LOGIC 


i lie BUT/HBT Principles and Characteristics 


a) Device Structure 

A bipolar junction transistor (BJT) 1s a sandwich 
structure of three separately doped regions of silicon (or 
other suitable semiconductor), such that one of two 
configurations exists. One configuration is the pnp 
transistor where a negatively doped region is bounded on 
either end by positively doped regions (p-type transistor). 
The other configuration is the npn transistor where a 
positively doped region is bounded on either end by 
negatively doped regions (n-type transistor). Figure (2-5) 
provides a simplified illustration and further identifies 
the proper names for the regions: collector, base, and 


SMrtter . 


Emitter Collector 





Base 


Figure 2-5. Structure of a Bipolar Junction Transistor 
(After Pierret). 


des 


Until recent years, BUTs were generally fabricated 
from a single semiconductor material. However, device- 
level physics has demonstrated that faster junction 
transistors can be constructed from dissimilar semiconductor 
materials with complementary properties. Such devices are 
known as heterojunction bipolar transistors (HBTs) . 
Conveniently enough, their operational behavior is 
essentially governed by the same functional principles as 
Bwrs sPrereet, moos )x Therefore, it 1S assumed that 
wherever BJT behavior is referenced, a direct correspondence 
to HBT behavior exists. The following sections will provide 


a fundamental understanding of that behavior. 


b) Device Function 

The significance of the BUT lies in its potential 
to behave as a current-controlled current source when the 
proper DC bias is applied to the three regions or terminals. 
The controlling terminal is the base. Applying the proper 
DC bias to an npn transistor, a small current flowing into 
the base will produce a proportionately larger current being 
drawn into the collector, across the base region, and out of 
the emitter (Figure 2-6). The converse is true for a 
properly biased pnp transistor. A small current drawn out of 
the base will produce a proportionately larger current being 
drawn into the emitter, across the base region, and out of 


the collector. From this point forward, it will be helpful 
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(a) (b) 


Figure 2-6. A functional illustration of an (a) npn and 
a (b) pnp bipolar junction transistor (After Sedra). 


to limit the discussion to npn transistors, because the pnp 
transistors operate in avery similar manner (with reversed 
polarity) and npn transistors are the only type encountered 
in the chapters ahead. 

As stipulated in the preceding discussion, proper 
DC bias conditions must exist in order to achieve the 
desired performance. Depending upon the DC bias, the 
transistor will operate in one of the following modes of 
operation: cutoff, active, or saturation. In the first 
case, the emitter-base junction 1s reverse biased which 


eas. < WV TOh whew pn Umer ror non 7 scam This also 


BE{on) 


implies that V,. < V,,.., for the collector-base junction. 
Therefore, the collector-base junction is also reverse 
biased. This condition is known as the "cutoff" mode since 


effectively no current flows through the transistor. 


LS 


In the two remaining modes, the emitter-base 
junction is forward biased, and the transistor conducts 


current. The mode of operation is distinguished by the 
condition of the collector-base junction — using the 


emitter as a common reference for both the collector and 


base. ey ey then the base-collector junction is 


CE(sat) 
saturated, and the flow of current from collector to emitter 
is not linearly dependent on I,. Conversely; when V. > Va 
for the base-collector junction, then it is reverse biased 
and current 1s swept from the collector, across the base, 
and out of the emitter in linear proportion to the amount of 
base current applied. This is known as the active region. 

Table (2-1) summarizes the relationships which 
govern the three regions of operation. Furthermore, Figure 
(2-7) is an i-v curve for the Hughes InP HBT (1x1 micron). 
It serves to illustrate the active and saturation modes of 
BJT operation while also providing necessary design 
information that relates the base-emitter voltage drop (V,,) 
to collector current levels (1,). 


The linearly proportionate increase in collector 


current. relative..to base Culrbrent 1s referred t0 asian 


common-emitter current gain, Beta (f), as shown in Equation 
(225) gee Sed igay 993) 
(2-5) i Exe 

1, 


skis 


Mode of Base-Emitter Collector-Emitter 


Operation Junction WUNnCETON 
Bias Relationship Bias Relationship 
eCucort Reverse Ni =. Veena Reverse =— 
Sacuracion | Forward Ma arash Forward Nie ee. 
Active Forward A a ae Forward Vie >) Ve 


Table 2-1. Relationships governing the operational regions 
of the BUT transistors (After Sedra). 
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Figure 2-7. I-V Curve for the InP HBT. 
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Figure 2-8. Variation of Beta for the InP HBT with 
respect to V,, and V... 


Beta is a device parameter for BJUTs — a function of the 


device physics and dimensions. Figure (2-8) illustrates how 
Beta varies according to the values of base-emitter voltage 


and collector-emitter voltage. 


Finally, a simple application of Kirchoff’s 
Current Law produces Equation (2-6) — an important 
relationship for current through the transistor. 


(2-6) JL = I, a7 IT. 
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c) DC Analysis of a BJT Circuit 

In order to illustrate the basic concepts of BJT 
operation as presented in the previous’ section, the 
transistor circuit in Figure (2-9) is now examined. Given 
the reference voltages, the turn-on voltage for the emitter- 
base junction (0.75v), and Beta for the transistor, it is 


readily determined that V,, > V, and therefore the 


BE(on) / 
emitter-base junction is forward biased. DC analysis 
reveals the value of V, and I,. Applying the equations from 
the previous section, I,, I,, and V,are determined, and it is 


concluded that the transistor 1s operating in the active 


region. 


DC ANALYSIS: 


V, = Ov 
Va = VE 25 Veg(on) = ORT 
1, = Va _ Vv — O-/Vv _ 3 
R, 100kQ 
I. = Bx I, = 4.3mA 
+ SV pa 2 ee aed bs SMB ign 
Ve = Vee —1,.R. = 10v — (4.3mA) 
= Oey: 





Figure 2-9. DC Analysis of a simple BJT circuit. 


ty 


In anticipation of logic applications, conside: 
the base voltage as a logical input which is either high 


or low (below V 


(above V ae 


ms) For a logic high inpue 
the transistor operates in the active mode, causing the 
voltage at the collector drop below V.. by an amount equal ge 
in Alternately, for a logic low input the transistor 
operates in the cutoff mode, drawing effectively no current 
through the collector and leaving V. approximately equal to 
V... ©The functionality of this circuit is essentially that 


ce 


of a basic BUT inverter. 


d) BJT Differential Pair 

Before committing to the discussion of transistor 
logic circuits, it 1s necessary to introduce a configuration 
that maximizes the switching speed of the BJT transistor: 
the differential pair. A differential pair is constructed 
from two matched transistors (Q, and Q,) with their emitters 
attached to a common current source and their collectors 
independently biased via separate pull-up resistors to a 
common voltage source, as shown in Figure (2-10). The base 
terminals are attached to separate voltage sources of equal 
value. Assuming the transistors have been given the proper 
DC bias for operation in the active mode, the relationship 


in Equation (2-7) is readily determined. 


at) ie =e 
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Figure 2-11. Example of a BJT Differential 
Pair configuration. 


Now, consider the scenario where V,, 1s constant 
and V,, is allowed to vary between two extremes: one above 
and one below V,,. When V,, reaches a voltage sufficiently 


larger than V,,, all of the current from I,,,. 1s steered 
through Q, such that Q, 1s cutoff. Conversely, when V,, drops 
sufficiently below V,,, 0, is on and Q, 1s cutoff. As noted 
in the DC analysis of the previous BJT circuit, the 
collector voltage of Q, exhibits the behavior of a logic 


inverter with respect to V,,, while the opposite collector 


voltage (Q,) functions as a non-inverting buffer. 


ou 


While the availability of complementary output 
voltages is certainly convenient, the most important 
observation of the differential pair is its switching speed. 
A relatively small voltage difference between V,, and V,, is 
required to switch the current almost entirely to the 
opposite path. More specifically, for a differential pair 
implemented with the Hughes InP HBT, it is shown in Figure 
(2-11) that a difference of only 75mV is sufficient to 


switch 90% of the current. 


Q1] = Input Voltage 90% of the current 


Q2 = Reference Voltage (0.775v) | “| has switched sides 
Bias Current = 1mA | at Vig = 75mV 


Current (uA) 





0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 


Input Voltage (V) 


Figure 2-11. Current Switching Characteristic of the InP 
HBT Differential Pair. 
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Furthermore, since Q, and Q, are biased to operate 
in the active mode, the switching occurs faster than 
scenarios which may place the transistors in saturation 
mode. This is because a saturated transistor stores charge 
in its base. That charge must be dissipated before 
Syrching Gan occur: 

It ais the current-steering property of the 
differential pair configuration which ultimately provides a 
foundation for the development of current mode logic, as 
will be discussed later in this chapter. However, before 
reaching that discussion, a brief overview of the dominant 
BJT logic families will serve to accentuate the advantage of 
current mode logic. 

2: BJT/HBT Logic Families 

This discussion is not intended to address all BUT/HBT 
logic families. Rather, the purpose here is summarize the 
principles of the two most popular and relevant BUJUT/HBT 
logic families. These are transistor-transistor logic and 
current-mode logic. Ultimately, this discussion culminates 
with a comparison of the two logic families in order to 
justify the implementation of current-mode logic for high- 


speed applications. 


a) Transistor-Transistor Logic (TTL) 
Transistor-transistor logic evolved directly from 


Gdiode-transistor logic (DTL) in a successful effort to 
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eliminate the drawbacks of DTL. (Richards, 1967) While 
there were several stages in this evolution, the end product 
is a TTL family which resembles the inverter shown in Figure 
(2 =) The enhanced performance of TTL is predominately 
achieved through two fundamental design features. 

The first improvement is the use of a second 


transistor iam place of thevdtodes of ay DIL cireurt. For a 


Vec Vec 





Figure 2~12. TTL Inverter. 


low input voltage, Q, is turned on — rapidly drawing 


current from the base of Q, and dissipating the excess 
charge to achieve a faster transition. In the opposite 
case, when the input is high and Q, is cutoff, Q, is 
specifically engineered to have a low reverse Beta such that 
a small yet sufficient current flows out through the 
collector and is applied to the base of Q,. 

The second improvement is the use of an optimum 


output stage, commonly referred to as the "totem-pole" 
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output stage (not shown in the Figure 2-12). It combines 
the rapid high-to-low transition capability of the common- 
emitter output stage with the rapid low-to-high transition 
capability of the emitter-follower output stage. 

Based upon these two features in conjunction with 
other minor modifications, TTL logic achieved a level of 
popularity which made it the dominant design for SSI, MSI, 
and LSI circuits throughout two decades. Despite this 
success, standard TTL circuit speeds are still limited by 
two design issues. First, transistors operate in saturation 
mode which increases junction capacitance and its associated 
switching delay. Second, the resistance along the 
dissipation path for junction capacitance further increases 


this delay. 


b) Current-Mode Logic (CML) 

Current-mode logic is distinct from the design of 
other BJUT/HBT logic families. The term "current-mode" 
refers to the channeling of a constant current along 
alternate paths to achieve logic functionality in circuits. 
Since it 1s the presence or absence of current that 
determines the logical output, the maximum voltage swing can 
be relatively small in contrast to voltage-mode circuits, 
such as TTL. 

The distinguishing design feature of current-mode 


fogie circuits is the Bul “ditferential fpair. It is the 


Zo 


backbone of all CML circuits and the source™of crile@ieal 
advantages and disadvantages. The benefit of smaller logic 
Swings has already been mentioned. Also, the discussion of 
the BUT differential pair earlier in this chapter explained 
how the collector voltage swings (inverts) rapidly in 
response to reversing the polarity/magnitude of the 
differential inputs by a narrow margin of approximately 
75mv. This translates into a switching speed for CML which 
1s unsurpassed by its predecessors. Contributing to “Giese 
remarkable speed is the fact that the transistors of the 
differential pair can be operated in the active region and, 
therefore, do not suffer from the effects of excess charge 
stored at the transistor base. Unfortunately, the constant 
flow of current which enables these remarkable switching 
speeds also consumes a remarkable amount of power. 

For an illustration of how a CML cilxcime 


functions, consider the inverter in Figure (2-13). Let 
input B have a constant value — a reference voltage. When 


input A is high (greater than the reference voltage by at 
least 75mv), then 9, is turned on and 0, is cut off. lige 
current being drawn through R, produces a logic low (V,,-I,R,) 


at V Notably, the complement of this output, a logic 


epeieal © 


Won (4/7. is simultaneously available at V The presence 


cc ) Out2 ~ 


of complementary outputs is yet another benefit of CML 


Gene tS. When input A is switched from high to low, the 
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Voltage 


er Ibias 


Figure 2-13. CML Inverter. 


conditions for Q, and Q, reverse. O, tumnston and oO ero euc 


Ome iV 


out2 


1s pulled low while V 1s pueden gh: 


outl 


c) Advantages and Disadvantages 

For high-speed applications, the selection of a 
BJT logic design is reduced to a quantitative comparison of 
TEE ana CML. The predecessors of these two logic families 
are far inferior in their capability to dissipate the 
accumulated charge at the transistor base upon switching. 

If the only two criteria were maximizing speed 


while minimizing power consumption, then there could 


possibly be a toss-up between TTL and CML — ultimately to 


Za) 


be determined by the design which achieves the lowest power- 
delay product or by weighting one specification over the 
other (high-speed or low-power). Clearly, TTL is the low- 
power contender, while CML is the high-speed champion. 
However, before addressing the issue in the context of this 
design project, consider the following summary of advantages 
and disadvantages. 

In addition to being faster, CML requires a 
smaller voltage swing than TTL and is less susceptible to 
Getse meme EO Che mature sor they BIT dir lerent icmmecueae As 
another benefit of that nature, CML generates complementary 
SuEDUES. The fact that both output signals are referenced 
to V, provides for exceptional stability when V,. is 
referenced to ground and a negative supply voltage is used. 
Unfortunately for TTL, its strong point of consuming less 
power has a down side: the short pulses of current which 
must be generated for switching logic levels also create 
spikes in the supply voltage. The constant current drawn by 
CML circuits avoids this potential source of noise. 

In conclusion to this comparison, a logic designer 
presented with the choice of CML or TTL would only choose 
TTL in the event that power consumption made CML 
impractical. In real world applications, this is typically 
true. However, Since it is the purpose of this design 


project to explore the impact of high-speed logic on digital 


Zs 


system architecture, priority has been given to the superior 
speed and extensive design benefits of CML. 

Having concluded that current-mode logic is the 
best approach to HBT high-speed logic design, it is 
necessary to design a sufficient set of logic gates to 
implement the desired test circuit, an 8x8 bit pipelined 
Musca Dl 1er. Chapter III presents the discussion of logic 
circuit design which includes design of the following: an 
inverter/buffer gate, a NOR/OR gate, full adders, anda 


practical current source. 
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@it. HBT CML LOGIC CIRCUIT DESIGN 


A. DESIGN OVERVIEW 

In this chapter, CML logic circuits are designed which 
will serve as the building blocks for construction of the 
Hie lier logic. The design process is presented in the 
context of a single logic circuit, beginning with the most 
fundamental functions and progressing toward the more 
complex. Of note are the following general design goals 
which served as guidance for decision-making in the early 


stages of logic circuit design: 
e Minimize the rail voltages (i.e. supply voltage) 


e Achieve proper DC bias conditions with reliable 


noise margins and fanout 


e Optimize transient performance for speed and power 


consumption 


B. INVERTER DESIGN 

2 Circuit Topology 

Based upon the introduction to CML design in the 
previous chapter, Figure (3-1) illustrates the circuit 
topology of a CML inverter. A detailed description of its 
function is presented in the previous chapter and will not 
be repeated here. However, there is one subtle constraint 


in this design. One of the differential inputs is tied toa 


Sill 
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Figure 3-1. CML Inverter. 


reference voltage. While this is not essential for the 
design of an inverter, it will prove significant in the 
implementation of multiple-input logic gates. A common 
reference voltage eliminates the need to provide 
complementary logic signals for each input and furthermore, 
1t avoids the increase in supply voltage associated with 
multiple complementary inputs in a stacked series of 
Gifferential input pairs. 

Figure (3-2) illustrates the same inverter design as 
Figure (3-1); however, it also includes an emitter-follower 
stage at each collector output of the differential pair. 
The purpose of this stage is twofold. First, it provides a 
buffer between the input differential pair and the 


capacitive load of subsequent driven logic gates. Second, 


a2 


Inverted Output 
Buffer Stage 


Non-inverted Output 
Buffer Stage 






Reference Non- 
Inverted Voltage Inverted 
Output Output 
Rout 


Figure 3-2. CML Inverter with output buffer stages. 


1t produces a downward DC shift equal to the base-emitter 
turn-on voltage. Ideally, the gain of the emitter-follower 
1s one; however, in practice the gain is slightly less than 
one. The result is a slightly diminished voltage swing at 
the output of the emitter-follower when compared to the 
voltage swing at the collector of the differential pair. 
Whether or not to include the buffer stage represents a 
fundamental design issue for CML logic circuit design. Ata 
glance, performance arguments can be made both for and 
against it. On the one hand, it would appear to increase 
fanout performance, yet on the other, it would appear to 
decrease switching performance with the additional switching 
delay of a second transistor stage. Additionally, the non- 


buffered output topology would consume less power for a 
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given bias current. However, without performance data to 
substantiate one option over the other, both will be 
developed and evaluated (cuake sulk objective design 
considerations can identify a clear preference. 


ae. Initial Conditions and Design Parameters 


a) Voltage Parameters 

Having introduced the topology of the CML 
inverter, it 1s necessary to establish initial conditions 
for operation. The first is the supply voltage, which is 
bound by two primary considerations. It must be large 
enough to support the proper function of the circuit, i.e. 
provide proper transistor bias conditions and the desired 
voltage range between high and  =Ilow Logue levels. 
Conversely, it should be kept as small as possible, because 
the power consumed by the circuit is directly proportional 
to the magnitude of the supply voltage. 

Clearly, foresight must be exercised in order to 
determine the minimum supply voltage necessary to achieve 
proper DC «bias conditions for all transistors inewad! 
circuits of the design. In the context of this project, the 
D-type latch design (presented in Chapter IV) imposes the 
greatest demand on the supply voltage level by operating 
three transistors in series between the voltage supply 
rails. For optimum, reliable clocking performance of the 


latch, the logic reference voltage is determined to be 1.45 
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voltss This figure is based upon a maximum logic signal 
range of 0.5 volts and a maximum logic high voltage of 1.7 
volts (reference Chapter IV-A-3a for further details). 

Given this information, the minimum required 
supply voltage is determined for each inverter topology. 
Both require that the voltage at the collector (V.) be large 
enough to avoid saturation of Q,. Furthermore, both require 
that the voltage at the collector provide for an output 
voltage that matches the range of the input voltage. 

For the non-buffered topology, this implies an 
inverse match between the voltage at the base of Q, and the 
voltage at its collector. In other words, for a logic input 


Badtz as high, V the output voltage at the collector 


B(hi) / 
should be low, such that the following relationship in 
Equation (3-1) holds true. 


(3-1) wv = V 


C (low) B(hi) 


Assuming the collector of Q, draws approximately i1mA of 
CUmeent, “coblecctor-emlteer seatteallon Sveltage, “Vere is 


0.275 volts and the base-emitter turn-on voltage is 0.775 


woltcs . Under these conditions, 0, is” on the™boundary of 
active mode operation. For a signal swing larger than 0.5 
volts, the transistor would saturate. Conversely, for a 


logic input (V,) that is low, the collector voltage (V.) must 
be given by Equation (3-2). 


(3-2) V = + 0.5v 


@(hi) B (low) 


3D 


FOrReVe See cuc le comliiZavolts yaa must@be 1.7 volts® ~Thwey 


C (hi) 
for the non-buffered topology, the maximum voltage at the 
Gelllector aisiely7 «azelts No@eecurrenteetiows™ through “am 
because Q, is cutoff; therefore, the minimum required supply 
voltage is also 1.7 volts. 

In the case of the buffered topology, the DC 
voltage drop across the base-emitter junction of the output 
buffer imposes a greater demand. For the output voltage 
range to match the input voltage range, the voltage at the 
collector (as described in Equation 3-2) must be increased 


by an amount of V (as shown in Equation 3-3) in order to 


BE (on) 


counter the base-emitter voltage drop at the buffered 


Gmeowee 


(3-3) MW = + 0.5v + V 


C (hi) B(low) BE (on) 


Assuming a current of ImAsgor less through the butfer, | ae 


nce OMner seer cade erame The result is a minimum required supply 
voltage of 2.5 volts. (Reference Chapter IV-A-3a for a 
thorough derivation of these conclusions.) 

In summary, different supply voltage levels will 
be utilized for the two inverter topologies. The non- 
buffered output topology will employ a 1.7 volt supply 
voltage, while the buffered output topology will employ a 


Ze0e0701C (supply voltage 
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b) Transistor Area/Size 

In order to optimize switching speeds in BJUT/HBT 
transistors, it is desirable to keep the device area small, 
thereby minimizing parasitic capacitances. Likewise, a 
smaller device size requires less current and less current 
means less power. The InP HBT device sizes made available 


from Hughes Research Laboratories have junction areas of 


ici 1x3, ix5, and 2x5 microns.".)The 43:1 area transistor is, 
therefore, the transistor of choice fens Switching 
Spplications (logic circuits). Note, however, that the 


consideration of device size must be re-visited for 
applications where switching speed is not a factor, i.e. the 
construction of a practical current source (addressed in 


Chapter IV). 


c) Fanout Requirement 

Fanout is the number of logic gate inputs that a 
Single gate output can drive, while providing voltage levels 
within the correct logic’ range. Increased fanout is 
achieved at the expense of power consumption and loss of 
speed. Considering that the CML logic inputs/loads are 
current-driven, increased fanout will require a 
corresponding increase in switching delay and/or current. 
As a result, the fanout parameter should be chosen such that 
1t sufficiently economizes the number of logic gates and 


levels of logic required without needlessly sacrificing 


oy), 


power and speed. In meeting this requirement, a reasonable 
fanout parameter has been established based upon the logic- 
level design of the a three-input adder (reference Chapter 
III-D). For implementation using the minimum number of 
logic levels, a three-input adder requires a fanout of four. 


Bic DC Analysis 


a) Overview 
Given the circuit topology for a CML inverter as 
shown previously in Figure (3-2), the first step in circuit 


design is to establish the proper DC bias conditions for 


operation. This can be done for both the buffered and non- 
buffered cases simultaneously. For the non-buffered case, 
Simply disregard the presence of the buffer stages. The 


remaining node voltages at the collector outputs on the 
differential pair are the same. 

Figures (3-3a) and (3-3b) show the DC node 
voltages for the desired operation of a CML inverter given a 
high logic input and a low logic input, respectively. Given 
matched transistors the two sides of the differential pair 
could be considered symmetric in their behavior, except that 
the input voltages driving the opposite sides of the 
differential pair are not symmetric. That is, the reference 
voltage drives the differential pair at 1.45 volts whereas 
the logic input €@rives a au, volts. The result is a 


Gut iterence sor 0) 25mmeltis  atramenes cmicter.: This is a Minor 
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(a) NOTE: Leakage 
Current is 
Neglected 





Voc- Upgatter) Resin 


Qa 
(b) 


Figure 3-3. DC Analysis of a CML Inverter for (a) a HIGH 
input logic level and (b) a LOW input logic level. 


observation at present, but it explains the non-symmetric 
performance that 1S encountered between the two output 


Signals (the inverted and the non-inverted signals). 


Se 


b) Gain Resistor 
In order to take advantage of the switching speed 


of the differential pair, transistors must be biased to 


operate in the active mode. Therefore, the value of the 
base-emitter voltages (V,,) for 0, and 0, must be such that 
Ay ee Thus, for a given supply voltage and bias 


current, there is a restriction on the magnitude of the 


voltage drop across R If the drop is too large, the 


gain’ 


transistor will saturate. Conversely, the voltage drop must 


not be too small because it is the product of I and R 


R-gain gain 


which determines the magnitude of the signal voltage swing 
(assuming active operation). This same voltage range 
applies to the output of the buffer stages as well. As 
referenced earlier in this chapter, a constant DC shift of 
Vazwon) +S the only difference between the nodes V,, and V,,,. 

In summary, the Significance of R_,, 1s two-folee 
it must be small enough to keep Q, (and Q,) operating in the 
active mode, and it must be large enough to provide a 
Satisfactory voltage swing between logic levels. Figure 
(3-4) illustrates the DC transfer characteristic of the 
inverter for various values of gain resistance. 1Rie 
effectively demonstrates the upper and lower limitations of 


gain resistance for a value of I equal to I1mA. At 


bias 


resistances of 500 ohms and less, the desired 0.5 volt 


Signal swing is not achieved, and at resistances of 600 ohms 
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Output Voltage (V) 


Bias Current = 1mA 
Reference Voltage = 1.45v 
Buffer Resistance = 1k Q 





0.0 0.5 1.0 15 2.0 2.5 


Input Voltage (V) 


Figure 3-4. Effect of Gain Resistor Variation on 
Inverter Output. 


and greater, the effect of saturation can be observed by the 


upward bend in the curve. 


c) Buffer Resistor 

The buttemmres#ecene(k governs the amount "or 
Current drawn by ene cClime terme ClaceiomotOLne = © cleo The 
magnitude of emitter current is directly proportional to the 
base current which is drawn from the collector of the 
differential pair. Thus, the base current of the output 
buffer represents a small portion of the current passing 


marOugh In this way, the size of the buffer resistor 


gain. 
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effectively produces a small DC offset at the buffered 
output while regulating the amount of current drawn through 
the buffer stage. 

This is significant for two reasons. First, @ae 
facilitates optimization of switching speed versus power 
consumption by providing a mechanism for controlling the 
amount of current flowing through the buffer stage and 
therefore, available to drive a logic load. Second, R,, is 
inversely proportional to a DC voltage offset at the 
buffered output. The ability to control “this offset wire 
especially helpful in matching the output signal swing to 
the input. Figure (3-5) represents the variation of output 
voltage for a range of resistor values based upon a bias 


current of 1mA. 


d) Bias Current 

Bias current is directly proportional to the 
current {(1I,) drawn through the gain) ~esistor — haem 
Therefore, bias current drives the magnitude of the voltage 
drop produced in the gain resistor, and this voltage drop 
corresponds to the maximum signal voltage swing. For thas 
must be 


reason, a proper combination of I and R 


bias gain 


determined to provide the desired 0.5 volt swing. In order 
to select from an infinite set of current-resistor 
combinations, a likely set of current-resistor pairs will be 


identified co represent the prackucal range of 
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Bias Current = 1mA 
Reference Voltage = 1.45v 
Gain Resistance = 600 02 


Output Voltage (V) 





0.0 0.5 1.0 Ls 2.0 25 


Input Voltage (V) 


Figure 3-5. Effect of Buffer Resistor Variation on 
inverter Output. 


possibilities. This is done for both the buffered and non- 
buffered inverter topologies. Note, the non-buffered 
topology can be allowed to draw a higher bias current 
through the differential pair because it does not draw any 


additional current through buffer stages. 


e) DC Noise Margins 

Once values of resistance and bias current are 
established, the circuit topology is completely defined and 
a DC transfer curve can be obtained. From this plot the DC 
noise margins for a particular design are calculated. Noise 


margins provide a measure of the allowable noise which can 
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be received at the input without affecting the correct logic 
evil cjo) bam Since this circuit will be operating with such a 
narrow Signal voltage swing, noise margins are a critical 
interest for establishing reliable DC bias conditions. 
Equations (3-4) and (3-5) define the high and low noise 
margins in terms of the maximum and minimum, high and low 


logic values. (Weste, 1993) 
(3-4) NMS 
(3-5) NM, 


where, 


II | 
re te 


minimum HIGH input voltage 
maximum LOW input voltage 

minimum HIGH output voltage 
maximum LOW output voltage 


THmin 


iLmax 


OHmin 


qsa.< 
T 


OLmax 


These logic values are extracted from the DC transfer curve. 
The two unity gain points (where the slope equals negative 
one) of the DC transfer curve have been used to define the 


boundaries of these regions. 


£) DC Bias Optimization 

Given a set of practical current values, DC 
analysis is employed to identify a set of matching gain 
resistances which properly bias the inverter for logic 
operations. For each pair of current-resistor values, a DC 
transfer characteristic is obtained to determine the noise 
margins and the maximum range of the signal swing. The 


results are tabulated in Table (3-1). In the absence of a 
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loads each configuration met the established design 
requirements — that is, a matched input and output signal 


voltage range of 0.5 volts, centered at a reference voltage 
of 1.45 volts with sufficiently balanced noise margins of 
0.1 volt minimum (20% of the signal range). 

However, when examined under the maximum fanout 
load (which is four), the performance of the non-buffered 
output topology suffers greatly. The maximum high logic 
voltage is reduced by an amount ranging from 0.09 volt to 
0.23 volt, depending upon the bias configuration. Not only 
does a load reduce the desired 0.5 volt signal range, but it 
also erodes the high-end noise margin. As a result, the non- 
buffered output topology can now be eliminated from further 
consideration in the design process. 


As for the buffered output topology, the noise 
margins and voltage range are remarkably consistent — 


regardless of the loading. The output buffer effectively 
isolates the current drawn by the load from the current in 
the differential pair. Thus’, each One the bias 
configurations for the buffered output topology will be 
further tested under transient conditions to identify the 
optimum inverter design. It should be noted that the DC 
analysis presented here and the transient performance 
analysis which follows are both conducted using ideal 


current source models. 
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4. AC/Transient Analysis 


a) Delay Measurements 

Transient performance of OCG ate Guat s eS 
generally quantified by measuring the delay associated with 
Signal propagation. The delay times utilized here are 
standard performance parameters. However, for completeness, 


their mathematical definitions are provided below in 


Equations (3-6) and (3-7). (Weste, 1993) 
(3-6) E..., = time for a logic signal to traverse 
Erom 0 See 3 co 0 leave 
(3-7) Cliecw =, bimMe for a logic signal to traverse 
PROMO LEN. Ops O Cl aey eres 
where, Vance = the voltage difference between the 


steadygstate Vanda 


b) Performance Parameters 

At this point in the design process, two 
performance parameters are of primary concern, power and 
speed. Being related to each other, there is often a trade- 
off between the two. Optimization of these two parameters 
will determine which of the DC bias inverter configurations 
will be implemented. A common method of optimization is to 


quantify the parameters of power and speed as a single 


4'7 


figure of merit, such as a prodiiet @r awratiem Optimization 
is then achieved by maximizing or minimizing the appropriate 
figure of merit. 

Power-delay product 1s one such figure of merit. 
Tt is simply the product of the power consumed by a logic 
circuit multiplied times the propagation delay of the signal 
EFOM A2nput Sto. OuURpUC . Expectedly, the design that most 
efficiently balances the trade-off between speed and power 
consumption will yield the lowest power-delay product in 
transient testing. 

The ratio of speed to power provides a similar 
figure of merit, but speed measurements are not as clearly 
defined as delay measurements. Therefore, in the interest 
of optimizing this design for speed, a definition of maximum 
switching frequency will now be established. The maximum 
reliable frequency is defined as the maximum switching 
frequency of the logic input signal for which a maximally 
loaded output signal consistently traverses 90% of the 0.5 


Volt rangevwor logic 


c) Transient Analysis Procedures 

For an accurate evaluation of logic circuit 
performance, it is necessary to provide a realistic input 
Signal and a worst-case output load. Here, the term load 
implies driving four inverters in parallel. To achieve a 


realistic test environment, the test circuit of Figure (3-6) 
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was designed. Specifically, note the location of gates A 
and B. Their input and output signals will be measured to 
analyze performance with a fanout of one and é$£ four, 


respectively. 


INPUP- >O-] >O4 >O+—4 





Figure 3-6. Test Circuit for Transient Analysis. 


It 1s expected that the use of a reference voltage 
at the differential input of the inverter will cause the 
inverted and non-inverted output signals to respond 
differently. As a result, two gate topologies are analyzed 
for each of the valid DC bias configurations from Table 
(3-1). The first gate topology is a single output inverter 
from which the inverted output signal is measured. The 
second 1s a complementary output inverter from which the 
non-inverted output signal is measured. Conveniently, these 


two configurations also represent the alternating signal 
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pattern which will characterize the adder circuits later in 
this chapter. 

imi evaiiy, the appropriate logic delays are 
measured at gate A and gate B in order to collect data for 
the cases of minimum and maximum loads, respectively. The 
worst-case delay is then multiplied by the average power per 


gate to obtain a power-delay product. This is done for both 
the inverted and the mnon-inverted output signals — 


providing separate power-delay product terms. Their sum 
forms a composite power-delay DEOGUCE - The composite 
power-delay product is a figure of merit which effectively 
represents the implementation of the two gate topologies in 
series. 

Finally, the switching period of the input logic 
is decremented for successive tests in order to determine 
the shortest period for which the output signal of a loaded 
gate (gate B) would consistently traverse the full range of 
logic (between high and low). This quantity has been 
defined in the previous section as the maximum reliable 
frequency (MRF). For each configuration, the maximum 
reliable frequency is divided by the average power per gate 
to obtain a speed-power ratio (GHz/mW). The presence of a 
secondary load provides confirmation that consecutive loads 
can be successfully driven when the primary load is driven 


at its maximum reliable frequency. 
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d) Summary of Results 


Tisesmesmen t 


analysis confirms’ the 


non-symmetric 
behavior of the inverted and non-inverted output signals. 


Therefore, Tables (3-2a) and (3-2b) provide details of their 


Bias Tprop Tprop Current Power Maximum 
Current L-H H-L per Gate perGate Power-Delay 
(mA) (pS) (pS) (mA) (mW) Product 
mW-pS 
0.1 42 255 0.81 2.03 518 
0.25 56 48 0.97 2.42 136 
0.5 ote 26 1.28 3.20 106 
0.75 23 26 1.59 3.99 104 
1 lis 26 1.88 4.69 a2 
1.5 13 27 2.38 5.94 160 


Table 3-2a. Power-Delay Data for the Inverted Signal. 
Single output topology with practical current sources and a 
fanout load of four. 


Bias Tprop Tprop Current Power Maximum 
Current L-H H-L per Gate per Gate Power-Delay 
(mA) (pS) (pS) (mA) (mW) Product 


(mW-pS) 


Os 212 82 1.45 3.63 770 
0.25 61 88 1.64 4.10 361 
0.5 Zi 63 2.02 5.04 318 
0.75 23 46 2.31 9.78 266 
1 19 41 2.63 6.56 269 
13 18 40 3.09 7.74 309 


Table 3-2b. Power-Delay Data for the Non-Inverted Signal. 
Complementary output topology with practical current 
sources and a fanout load of four. 


5) Ih 


respective delay measurements. Specifically, the high-to- 
low transition of the non-inverted output signal represents 
the worst-case transition. 

The overall performance of each DC bias 
configuration is summarized in Table (3-3). The power-delay 
product and speed-power ratio are normalized to simplify 
comparison. Figure (3-7) illustrates the minimization curve 
for the power-delay product, while Figure (3-8) shows the 
maximization curve for the speed-power ratio. 


Clearly, the 0.75mA configuration proves to be the 


optimum design — maximizing the speed-power ratio while 
minimizing the power-delay product. Furthermore, ie 
provides for a maximum reliable frequency of 8.7 GHz. This 


1S more than suitable to achieve the 5 GHz maximum clock 
frequency desired in Chapter V (for the maximally pipelined 


multiplier implementation). 


Bias Maximum Normalized Maximum Normalized 
Current Composite Composite Reliable Speed-Power 
(mA) Power-Delay Power-Delay Frequency Ratio 
Product Product (GHz) 

0.1 467 3.48 n/o n/a 
0.25 144 1.34 3610, 0.86 

0.5 96 1.14 7.10 0.94 
0.75 72 1.00 8.70 1.00 

1 67 1.06 9.09 0.92 
185 67 1.27 11.10 0.96 


Table 3-3. Summary of Transient Analysis Results. 
Composite Power-Delay Product and Speed-Power Ratio. 


a2 


Normalized Speed-Power 
Ratio 





0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 
Bias Current (mA) 


Figure 3-7. Results of Transient Analysis: 
Normalized Speed-Power Ratio of Inverter Configurations. 


y 


Normalized Power-Dela 
Product 





Bias Current (mA) 


Figure 3-8. Results of Transient Analysis: 
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5) Final Design Summary: Inverter 

The final design for the CML inverter/buffer circuit is 
illustrated in Figure (3-9). The applicable design and 
performance parameters have been summarized in Table (3-3). 
Here, the data represents performance when the design is 
implemented with the 0.75mA practical current source from 
Chapter III-E. Also note that when complementary output 
Signals are not required, the unused output buffer stage can 


be excluded to conserve power and minimize the device count. 


CML Inverter 
Design and Performance Parameters 


Rgain: 750 2 
Rout: 2000 2 

Ibias? 0.75mA 
NM: 0.13v (26% Vewing) 
NMz: O.14V (28% Vewing) 


Power: 5.78 mW (complementary output ) 
3.99 mW (single output) 


Inverted Signal Non-inverted Signal 
Delays Fanout = 1 Fanout = 4 Fanout = 1 Fanout = 4 
Cp (H-3) 14ps 26ps 39ps 46ps 
Ep (LL-H) 17ps 23ps 18ps 23ps 
[Saal 19ps 4i1ps 87ps 90ps 
eet ce 48ps 61ps 45ps 60ps 


Table 3-4. CML Inverter Design and Performance Parameters. 
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2.) Volts 





HBT 1x1 HBT 1x1 


Inverted ee Buffered 


Output Output 


Figure 3-9. Final Design of the CML Inverter. 


C. LOGIC NOR GATE DESIGN 

div: Overview and Analysis 

The circuit topology for a two-input CML NOR gate is 
presented in Figure (3-10). There is little that differs 
from the inverter, which accurately suggests that the 
analysis here will be extremely similar to the previous 
section. In fact, with regard to both circuit topology and 
performance analysis, the only distinguishing feature is the 


second logic input in parallel with the first. 


5) 


Consider the functionality of the two parallel inputs A 
and B. If either wot themes ag@ikeqgic high, then the Tere 
Side of the differential pair is on and the NOR output is 


pulled low. Conversely, if both inputs A and B are low, 





Figure 3-10. Circuit topology for a two-input OR/NOR 
logic gate. 


then the NOR output is high. On the opposite side of the 
differential pair is the complementary output — the OR 


TIT eee rare If another input transistor were added in 


parallel to the existing two, it would be a three-input 
OR/NOR gate — and similarly for a fourth input. 


Despite the drastic change in functionality, the 
presence of several logic inputs in parallel to the original 


logic input induces no fundamental change to the DC bias of 
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ENe Geir CUlieetceea. Gesuli:, “Ghew De bias @onditions for the 
optimized inverter circuit are directly applied to the final 
design of the NOR circuit. 

OR Final Design Summary: OR/NOR 

With the exception of having multiple parallel 
transistors for multiple logic inputs, the final design for 
eee CML, OR/NOR leog@e circuit 1s identical to that of the 
inverter. As for its performance, the noise margins and 
delay measurements vary only slightly in response to the 
"multiple trigger" effect of simultaneous parallel inputs. 
The design parameters are identical to the inverter and 
therefore are not repeated. However, a selection of the 
performance parameters have been provided in Table (3-5) in 
order to demonstrate the variation of performance based upon 
Enem—EnpDut COonriguration. 


Conveniently, the NOR gate constitutes a near identical 
capacitive load as the inverter — with maximum delay 


differences of less than 1.5ps. It exhibits the same delay 
variations between its OR and NOR signals as the inverter 
does between the inverted and non-inverted signals. And 
fae) Teel.) va as with the inverter, when both of the 
complementary outputs of the OR/NOR gate are not required, 
the unused output buffer stage is not included to conserve 


power and minimize the device count. 
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CML OR/NOR Gate 
Delay Performance Parameters 


2-Input OR/NOR Gate 
Single Input Transition 





NOR Signal OR Signal 
Fanout = 1 Fanout = 4 Fanout = 1 £Fanout = 4 
Single to (H-1) 16ps 29ps 40ps 4’lps 
Input 
Transition © -#) 24ps 29ps oes 230s 
3-Input OR/NOR Gate 
Single and Simultaneous Input Transitions 
NOR Signal OR Signal 
Fanout = 1 Fanout = 4 Fanout = 1 #£Fanout = 4 
Bingie taeae. pied) 19ps 28ps Alps 48ps 
Transition C5 -x) 29ps 34ps 18ps 23ps 
Simultaneous “p(e-n) 1'/ps 3 6ps 40ps 47Tps 
Input. ©. (1-H) 43ps 48ps lips 16ps 
Transition 
4-Input OR/NOR Gate 
Single Input Transition 
NOR Signal OR Signal 
Fanout = 1 Fanout = 4 Fanout = 1 #£Fanout = 4 
Single Input %q@-n) 21ps 30ps 41ps 48ps 
Transition eye eg 33ps 39ps 18ps fa eyes 





Table 3-5. Summary of OR/NOR Gate Delay Performance. 
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ae Implementation of the AND Function 

In current-mode logic, the AND function is implemented 
by simply inverting the input signals and reversing the 
polarity designation of the output nodes. In actual 
practice, inverters and OR/NOR gates are sufficient to 
realize any logic function. Thus, for the sake of 
Simplicity, AND gateS were not constructed as a separate 
i exe pig renheoilhtice Rather, all igeye gue: functions were 
deliberately expressed as functions of inverters and OR/NOR 


gates. 


cD. ADDER DESIGN 

12 Implementation 

Two-input and three-input adders are required to 
construct the carry-save adders and carry-completion adders 
of the multiplier (Chapter V). Equipped with a sufficient 
set of logic gates, this is an elementary task. The sum of 


min-terms for the sum and carry bits of a two-input adder 


are shown in Equations (3-8) and (3-9), respectively. 
(3-8) Sum|.sour = XY’ + XY 
(3-9) COBEN | oaye 2 Ok 





Employing De’Morgan’s Theorem, these expressions can be 


manipulated inte the equivalent expressions Or 


a9 


implementation with OR/NOR gates, as shown in Equations 


(20) candi —1 ie 


(3-10) Sum | = (X’+Y)’ Gace 


2input 


(3-11) Carry | =e ee)” 

This adder design requires the complementary logic inputs be 
provided in order to eliminate the need for inverters anda 
third level of logic delay. Such a requirement is trivial 
because complementary Signals are potentially available at 
the output of each CML logic gate. Figure (3-11) 


illustrates the two-input adder. 





XN 


2 


Figure 3-11. Two-input adder with identification of the 
critical path. 
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c= 


as 


(3- 


(3- 


kk 


A similar procedure was followed to implement Equations 
12) and (3-13) for the construction of a 3-input adder, 


illustrated in Figure (3-12). 


12) Sum|,.nue = (X+Y+Z) 4° + (X+¥4Z')' 
eerie) NCO EY Ze) a 
13) ATG TAY | eee (et Zr gym ant eee) terete 
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Figure 3-12. Three-input adder with identification of 
the critical path. 
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ns Performance Analysis 

Proper functioning of each adder was verified for all 
possible input combinations. Notice that the critical path 
for each adder is identified in Figures (3-11) and (3-12). 
For the two-input adder, the critical path flows through two 
levels of logic to produce the sum bit. The worst case 
transition is from a (1/0) or a (0/1) input for (X/Y) to a 
(1/1) input. This 1S owing to the fact that the worst-case 
gate delay is the high-to-low transition of the OR output 
when it has been driven by the high-to-low output transition 
of the preceding NOR gate. Based upon the data from Table 
(3-5), the critical path delay equals 63 picoseconds. This 
provides a good match with a simulation of the critical path 
delay which yields 60 picoseconds. 

Similarly, for the three-input adder the critical path 
delay is calculated to be 67 picoseconds along the path 
illustrated in Figure (3-12). This was validated with a 


Simulation measurement of 66 picoseconds. 


E. PRACTICAL CURRENT SOURCE DESIGN 

i: Circuit Topologies 

Up to this point, each logic element has been designed 
uSing an ideal current source. In order to validate the 
performance of these designs for actual implementation, it 
1s necessary to construct a practical current source. There 


are effectively three circuit configurations which provide 


6 


transistor bias conditions for establishing a current 
source. These three topologies are presented in Figure 
fo —13).. In each configuration the amount of bias current 
drawn is regulated by and directly proportional to the 


magnitude of the current drawn by the base of Q.,..4H.- 


Voc 
Voc 
POS 


- POS ° 
POS [in | 
| Ipias 


QsouRcE QyarRoR Qsource 


SA 
(a) (b) (c) 


Figure 3-13. Current Source Topologies. 


2 Performance Analysis 
In order to analyze and compare the performance of each 


current source, three simple 0.75mA current sources are 
designed — one uSing each topology. Each is’ then 


implemented as the practical current source for the 
inverter/buffer circuit of Chapter III-B-5. Their relative 
performance is evaluated based upon the following design 


goals: 
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e Minimize the operational limitations due to 


frequency response 


e Approximate the performance of an ideal current 


SCUrECE 


e Minimize the cost of implementation (power and 

device count) 
The performance of each configuration is illustrated in 
Figure (3-14a) and (3-14b). Notice that each inverted 


output signal drops below the desired 1.2 volt voltage low 


level when making the transition from high-to-low. This 
ur (allo) results from reversing the polarity om the 
differential pair input signals — inducing a brief drop in 


the bias voltage at the positive (POS) terminal of the 
current source. A delayed return to the proper bias voltage 
is then governed by the RC characteristics of the Ou. 
Collector. This delay is particularly observed in the 
transient performance of the topologies in Figure (3-13a) 
and (3-13b). 

3. Final Design: Current Source 

By process of elimination, the current mirror topology 
of Figure (3-13c) is the only design suitable for driving a 
logic device family that is capable of switching frequencies 
above 8 GHz. Unfortunately, the current mirror also incurs 


the largest cost in terms of power and device count. Thus, 


to reduce the amount of current "lost" through the left side 
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of the current mirror, 1s given a smaller area than 


OTTTTOR 




































O a Testing a@ variety Of such configurations yields a 
current mirror configuration that implements Q,..,. with a 
Seenreetrweron transistor and eo. with a (Ix3) mieron 
transistor. 
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Figure 3-14. Transient performance of three practical 
current source topologies compared to an ideal source. 
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a) 0.75mA Current Source 

The final scurrent source design for a 0.75 
current source is shown in Figure (3-15). The DC transfer 
characteristic of this source, Figure (3-16), illustrates 
that the bias current drawn is a function of the collector- 
emitter voltage (V.,) at Quer: More specifically, it is seen 
that V,, must be greater than 0.3 volts in order to ensure 
that 0.75mA 1s drawn. This represents a critical design 


parameter for establishing a proper DC bias on the current 


SOuULrece. 


Vec 
R = 5250Q POS 
I,.= 0.75mA 
QiRROR QsouRCE 


Figure 3-15. Final Design of a Practical 0.75mA Current 
Source. 


The 0.75mA current source design is validated by a 
direct performance comparison with an ideal current source. 
Figure (3-17) compares the output signals for a maximally 


loaded inverter/buffer..circulit when. driven, bya 
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Figure 3-16. Transfer Characteristic of the 0.75mA 


Output Voltage (V) 


Current Source. 
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Figure 3-17. Comparison of Inverter Performance, 
Practical Current Source vs. an Ideal Source. 
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an ideal and a practical current source. It can be seen 
that the transition delay resulting from the practical 
source 1S consistently ahead of the ideal source for the 
inverted output signal by a margin of five picoseconds. 
Meanwhile, the non-inverted output signal of the practical 
current source maintains the status quo by matching the pair 
delay of the ideal source. In a design that is 
characterized by alternating stages of positive and negative 
logic signals, it is reasonable to expect that the 
implementation of the practical current source would yield a 
slight improvement over the ideal source. 

b) 2.0mA Current Source 

Exercising a little foresight into the conclusions 


of Chapter IV, it is convenient here to present the design 


of the 2mA practical current source. This design is a 
Simple modification to the 0.75mA design — implemented by 
decreasing the resistance from 5250 Q tomzezue«.. This 
allows an increase of current flow into the base of Q and 


MIRROR 


produces the transfer characteristic shown in Figure (3-18). 


Again, a bias voltage at Q must ensure that V.. 1s greater 


MIRROR 
than or equal to 0.3 volts in order to achieve proper 
functioning of the current source. 

The 2mA current source is also validated by 


testing it against an ideal current source while driving a 


maximally loaded D-type CML Latch. The respective output 
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Figure 3-18. Transfer Characteristic of the 2.0mA 
Current Source. 


Signals, Q and ON, are plotted in Figure (3-19). It can be 
seen that the output signal transition delay resulting from 
the practical source compares favorably with the delay 
associated with the ideal source. However, the ideal-driven 
output signals consistently crosses the reference voltage of 
1.45 volts approximately 10 picoseconds ahead of the 
practical-source-driven output signals. Thus, the effective 
margin of error for approximating the practical source with 
an ideal source is 10 picoseconds. In a synchronous 
pipelined architecture, this simply adds between 10 and 20 


picoseconds to the minimum clock period. 
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Figure 3-19. Comparison of Latch Performance, Practical 
Current Source vs. an Ideal Source. 


In summary, a sufficient set of logic circuits 1s now 
in hand, along with a practical current source with which to 
drive them. Thus, the combinational logic for a multiplier 
can be fully implemented. However, based upon the intent of 
pipelining this multiplier, it 1s necessary to construct the 
clock-driven devices that will control the flow of data. 
Chapter IV presents this discussion with the design of a D- 


type latch, a D-type flip-flop, and a clock driver. 
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IV. HBT CML LATCH AND REGISTER DESIGN 


A. LATCH DESIGN 


lig Circuit Topology 


a) Two Latch Topologies 

The most common latch design is based upon the 
logic level schematic illustrated in Figure (4-1). Design 
of this latch simply requires the proper connection of four 
NOR gates with the appropriate clock and logic input 
Signals. The cumulative power consumed by the four NOR 
gates constitutes a significant cost (based upon the four 


mMilliwatt per gate design from Chapter III). 


D 
© Q 
C) 
CLOCK 
QN 
DN S U 


Figure 4-1. D-type Latch constructed from NOR gates. 


pest 


However, the unique characteristics of CML provide an 
alternative design that yields comparable performance at a 
Significant savings in power. This CML latch design is 
illustrated in Figure (4-2). Due to the relative 
unfamiliarity of this design, a brief functional description 


follows. 


' Output Buffer 
Voc : Stage 


Rour 





Figure 4-2. CML D-type Latch Design (After Jalali). 
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b) Functional Description of a CML Latch 
Referencing Figure (4-2), the source labeled I.,,.. 
draws a constant current through the lower (clock-driven) 
Gifferential pair. Complementary clock signals provide the 
differential inputs. Depending upon the phase of the clock 
Signal, current 1s drawn from one of the two cascaded 
differential pairs, i.e. either the track pair or the latch 
Dali. Consider the case when the CLK signal is high. 
Current will be drawn from the "track" pair while the 
"latch" palraicussimultaneousiy cuteehit. In this case the 
latch is considered "open" or "transparent," and the track 
pair behaves like the differential pair configuration of the 
inverter/buffer logic gate. Thus, the logic inputs of the 
track pair are mirrored at the opposite collector. However, 
there iS one exception. In the CML latch, complementary 
logic inputs are employed rather than a logic reference 
voltage. For a single logic input, complementary input 
Signals enhance noise immunity and provide for symmetric 
waveforms at the complementary output ports. 

Now, consider when the CLK signal transitions from 
agi? tO “Wow. The track pair is cutoff as current is 
Switched to the latch pair via the right side of the clock- 
driven differential pair. Herein lies the significance of 
the common collector nodes shared by the track pair and 


latch pair. Due to the high impedance nature of the HBT 
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collector-base junction, the voltage level at the collector 
is slow to change and lingers long enough to bias the latch 
pair for essentially identical operation and output levels. 
This effectively latches the logic levels from the track 
pair to the latch pair. (Jalali, 1995) 

Regardless of the state of the latch, the logic 
levels at the common collector (of the track and latch 
pairs) are reflected at the latch output ports via the same 
output buffer configuration presented in Chapter III. 

2a Initial Conditions and Design Parameters 

The CML latch presents the most demanding DC bias 
requirements of any circuit designed for this project. Asa 
result, no voltage cap has been placed upon its design. 
Rather, the initial design goal is to determine the minimum 
necessary DC bias conditions for proper operation "cl Mame 
latch The resulting "voltage budget" will define the 
voltage relationships for proper operation of each 
transistor and differential pair. It will further establish 
important specifications for supply voltage and logic signal 
levels. Derivation of the "voltage budget" is presented as 
part of the DC analysis in the following section. 

The minimum available transistor area (1xl micron) is 
employed for optimum switching speeds, and the fanout 


requirement remains at four. These specifications are 
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consistent with the logic circuits designed in the previous 
chapter. 


3% DC Analysis 


a) DC Bias Conditions / The Voltage Budget 
For proper operation of the CML latch, each 
differential pair of transistors must be properly biased. 
Knowing the requirements imposed by proper DC bias 
conditions will reveal the following necessary design 
parameters: 
e Required minimum supply voltage 
e Required minimum voltage level 1Eehe 
representing the positive (high) phase of 


the clock 


e Required minimum voltage level EOr 


representing a logic high state 


e Maximum allowable signal range between 

high and low logic levels 
To facilitate analysis, the CML latch topology is divided 
into three levels of operation, as illustrated in Figure 
(4-3). Level one (the bottom level) 1s a practical current 
source. Implementing the design from Chapter III-E, the 


current source requires a minimum of V volts at node X in 


Ibias 


order to sustain the desired level of bias current. 
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This requirement imposes the following operational condition 
upon the "driving" base voltage of the Q,/0, differential 
pair (i.e. the high CLK voltage). 


(4-2) Vorx(hi) = V, ats Vag(on)ol2 


A further consideration is the proper biasing of 
the Q,/0, collectors for operation in the active region. 
This places the following operational condition upon the 
collector voltages (nodes Yl and Y2). 

(4-3) Vy 2 Vouxiniy 7 Vogion)|Q12 + Vogisat) 

where, V, represents either V,, or V,, 
Only the tracking differential pair (connected to node Y1) 
will be addressed at this point because it is driven by 
lower voltage levels which impose more restrictive DC bias 
Genmaitions on Yl “than Y2- 

Once again, a minimum voltage requirement at the 
common emitter of the Q,/0, differential pair presents a 
constraint on the minimum steady-state driving voltage at 
each base. This driving voltage corresponds to a logic 
high input voltage. Thus, the voltage level selected to 
represent a logic high must satisfy the following 
relationship. 


(4-4) Viocretna) Z Vag(on)|034 a Vyy 


Finally, three conditions must be satisfied at the 


collectors of “the track pair® “The first cond@meuon wus enor 


oe 


transistors Q, and Q, must operate in the active mode. This 


requires the following familiar relationship. 
(4-5) Vewiow! Zz Viocretni) = Vaeion)|034 ae Vexisat) 


where V, represents either V,, or V,, 


Similarly, the second condition requires that the 
transistors of the latch pair also operate in the active 
mode. This condition differs from the one above because the 
latch pair is driven by the collector voltage levels of the 


track pair. 


(4-6) Vewew) 2 Vewni) = Vag(on)|056 + Vorisat) 


Defining the voltage range of the logic signal (V,,,..) as the 


RANGE 


difference between high and low voltage levels, Equation 


(4-5) 1s manipulated to show the maximum value. 


(4-7) Veance = Vazion)|956 = Voxisat) 
Knowing the transistor parameters for V and V from 


BE (on) CE(sat} 


Chapter it, 3G Pome > VO mec 


RANGE’ max 


The third condition is that the input and output 


logic levels must match. A high logic input (V at the 


LOGIC (hi) ) 


transistor base must drive the collector voltage relatively 


low (V such that it produces a matched low logic output 


rn, 
at ON. Likewise, the inverse must also be true. The 


following equations express these requirements. 


(4-8) Viocrethi) r Vance = Vewow) a Vozion)[buffer 
(4-9) Vioerctiow) ag Veance = Vows) a Vap(on)|[buf fer 


yas: 


Based upon these relationships the maximum collector voltage 
1s determined, which further dictates the minimum required 
supply voltage for proper DC operating conditions. 

The voltage budget relationships are summarized in 
Figure (4-3). Actual values have been determined for four 
latch configurations as listed in Table (4-1). The 
essential difference is the magnitude of the bias current. 
An economical margin of safety has been built into these 
values. 

Notice that these margins have been allowed to 
vary slightly between configurations in order to maintain 
uniform values for clock and logic signal values. aa Ss 
greatly simplifies the comparative testing of the four 
ere arat one The design margins are highlighted to 
illustrate the negligible deviation incurred. All four 
configurations meet and exceed the required DC bias 


Cond) trens . In the event that uniform design margins had 


been used such that the supply voltages were optimized, the 
difference would have been trivial — within plus or minus 


0.1 volt or 4% %6£ Vehe 2.58V70lt supply voleage- 


b) DC Bias Optimization 

At this point the gain resistance, buffer 
resistance, and the bias current are the only undetermined 
parameters. The same procedures described in the design of 


the inverter/buffer circuit are employed to design four 


Ve 


CML Latch Voltage Budget 


for Multiple Bias Current Configurations 


1.5mA 


ImA 


Known/Measured Parameters: 


VBE(on) 
V ck (sat) 
ViLbias 


Determined Parameters: 


[VRANGE] max 
Margin for Range of 
Logic Signal Voltage 

[VRANGE]actual 


ee 
Margin to nearest 
tenth of a volt Vcc 
Veni) 


[Viocichiyactual 
Margin for Differential 
Logic Signal Switching 

(Viocicii)] min 
Vyi 


VcLK(hi) 
Vx 
Margin for Differential 
Clock Signal Switching 
Vi-bias 


Based upon a 0.5 volt signal swing for both logic and clock signals: 


VLOGIC(low) 
VcLKdow) 


EES 
0.26 
0.3 


0515 
0.015 


0.5 


ae) 


0.075 


2.425 


| wa 


~ 0.24 


1.46 
0.685 


Ji 
0.42 
On 2 


0.3 


12 
0.7 


0.80 
0.30 
Of 


0.5 
0.0 


0.5 


Thess) 
0.025 


2.475 


er! 
0.2 


1 
0.7 


V2 
0.4 
0.1 


0.3 


y2 
0.7 


2mA 


0.82 
0.31 
0.3 


Di 


eZ 
0.7 


3mA 


0:357 
0.35 
0.3 


O50" 
0.007 


0.5 


Pie 
0.0 


29 


1.7 


“MOS 


MS). 
0.693 


i 
0.358 
0.058 


0.3 


LZ 
0.7 


Table 4-1. Voltage Budget for the CML D-type Latch. 


different latch configurations based upon the specifications 
determined in Table (4-1). 

Noise Margins are obtained from the DC transfer 
characteristic of each. These results are included in Table 
(4-2). With maximum fanout loads on both output ports, all 
four CML latch designs meet the requirements of a 0.5 volt 
output signal range and 0.1 volt (20%) balanced noise 
Margins. Therefore, all four CML designs are considered in 


transient analysis. 


Bias Gain Buffer NoLoad/Loaded NoLoad/Loaded Logic 
Current Resistor Resistor High Noise Low Noise Signal 

(mA) (Ohms) (Ohms) Margin Margin Range 
(Volts) (Volts) (Volts) 

1 600 2000 0.14 / 0.13 0.13 / 0.13 0.49 
es 410 2000 0.13 / 0.13 0.13 / 0.13 0.51 
2 310 2000 0.12 / 0.12 0.12 / 0.12 O51 

S 210 2000 0.11 ("Oem 0.11 / 0.11 0.52 


Table 4-2. Results of DC Analysis. 


4. AC/Transient Analysis 


a) Performance Parameters 

Three parameters are of primary interest in 
evaluating the transient performance of a latch: setup 
time, hold time, and logic propagation delay. Figure (4-4) 
illustrates how each of these relates to the events on a 


transient plot. In the absence of a reference voltage, 
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Open Latched 
CLOCK 










SETUP Time 





HOLD Time 






Propagation Delay 
(Low-to-High) 






Figure 4-4. Illustration of setup time, hold time, and 
propagation delay. 
differential signal references are taken as the point where 
the complementary signals cross. 

As a figure of merit for optimizing the trade-off 
between speed and power, a power-delay product is calculated 
using the values defined here. The figure for power 
represents the average power, and the figure for delay 
represents the sum of the setup time and the worst-case 


propagation delay time. 


b) Analysis Procedures 
For an accurate evaluation of latch performance, 
it is necessary to provide realistic logic and clock input 


Signals as well as realistic worst-case fanout loads. 


eZ 


Furthermore, to ensure and demonstrate the proper DC bias 
design of the CML latch, practical current sources are 
implemented in testing. 

In addition to the four CML latch designs, the 
traditional logic latch is also tested. Bach design is 
substituted into the test circuit to determine the 


performance parameters described in the previous section. 


c) Summary of Results 

The results of transient analysis are summarized 
in Table (4-3). The 1.5mA configuration achieves the 
minimum power-delay product as illustrated in Figure (4-5). 


Note, however, that the 2mA configuration performs ata 


2.5 


2) NOR Latch 


1.5 


1 —_— oo 
1 15 2 3 


Bias Current (mA) 








0.5 


Normalized Power-Delay Product 





Figure 4-5. Results of Transient Analysis: 
Normalized Power-Delay Product of Latch Configurations. 
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comparable level of efficiency. In the interest of 
maximizing speed, it is a reasonable design trade-off to 
sacrifice two percent efficiency in order to acquire a 12 
percent reduction in latch delay. Thus, the 2mA CML latch 
configuration is selected for the implementation of a D-type 
ea echy., 

Regardless of the configuration, switching noise 
proves to be a prominent characteristic of transient 
performance in the CML latch. Figure (4-6) illustrates the 


effect of switching noise on the latch output, Q. The noise 
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Figure 4-6. Switching Noise in the CML Latch Output. 
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indicates a capacitive spike at the mutual collector nodes 
ofe the latch amd track diffierentrval pairs? This results 
each time the clock-driven pair switches current to the 
opposite side. It 1s not expected that this noise will 
adversely affect the ability of the CML latch to drive 
reliable logic levels. However, in the event that the CML 
latch is overcome by noise, the NOR latch configuration is a 
viable alternative because it does not experience this 
problem. 

Finally, the switching activity oe the 


differential pair also induces variations in the current 


drawn from the supply voltage. Figure (4-7) illustrates 
these power rail transients for a single CML latch. The 
o. 
Latch Pair 


3.8 j= More Current 


: [\) NK ry | rN, i N 
co (Pat IN = = 





x 
= 
a \ 
= LATCHING 
= 3. 
< | | | 
3 
| 
0 | \ | | 
de. we | 
PENING 


0.0 0.5 1.0 ee) 


Time (ns) 


Figure 4-7. Power Rail transients due to the switching 
activity of a single CML Latch. 
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abrubis, perwiedic sreduie@tronwmaine supply curment coincides with 


the Sr oneaemanset lleonmewobwmeurrent from “one Side of the 
differential pair to the other — driven by the switching of 


the clock signal. In the worst-case, this downward 
transient spike reaches a current level that is 18% below 
the average. It 1s also evident that slightly more current 
1s drawn when the latch is latched because the latch pair is 
driven by a higher input voltage than the track pair. MThis 
results in a higher voltage and thus more current being 
drawn at the practical current source. 

55 Special Latch Implementations 

In the course of this design project, two special 
implementations of the CML latch have been designed. The 
first implements a logic reference voltage at one of the 
Pegie “inputs. Of ge latch. The purpose here is to eliminate 
the requirement for complementary logic signals at the 
mune i | 1 ete 1 NOU ee 

The second special implementation also uses a reference 
voltage; however, it does so with the purpose of conducting 
a dogic function at the.input to stheg@latch. BAlehoughaehas 
Cizeuit Eumctions wells actually arecul ts isin mesligatl 
greater delays due to the increased collector capacitance at 
the tracking pair. As a result, it is not utilized in the 


MUMseo ler Cl en te 
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6. Final Design Summary: D-Latch 

The final design for the CML latch is implemented with 
the parameters listed in Table (4-4) using the topology 
presented previously in Figure (4-2). Also listed are the 
transient performance parameters for operation at each level 
of fanout loading. These figures represent the performance 
of the latch when it is implemented with a practical current 


source and driven by a maximally loaded clock driver. 


Latch 
Design and Performance Summary 
Regain: 310 Q 
Rout? 2000 2 
Ibias: 2 mA 
NM:: 0.12 v 
NMy: 0.12 v 


Power: 9.0 mw 


Max 
Fanout Setup Hold torop torop Total 
Load Time Time H-L L-H Delay 
( # gates) (pS) (pS) (pS) (pS) (pS) 
1 33 9 27 0 60 
Z Ss: 10 28 1 61 
3 34 10 31 2 65 
4 So 10 34 3 69 





Table 4-4. Final Design Summary of the D-type 
CML Latch. 
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Bi FLIP-FLOP DESIGN (D-TYPE) 

le: Overview and Analysis 

The D-type flip-flop is constructed from two D-type CML 
latches. The two latches are connected in a master-slave 
configuration such that they are latched by opposite phases 
eamthe clock: This simple design is illustrated in Figure 


(4-7). 





D Q 
D-LATCH D-LATCH 
DN QN 
OPEN LATCH 
CLOCK INVERTED INVERTED CLOCK 
CLOCK CLOCK 


Figure 4-7. D-type Flip-Flop. 


The flip-flop design is tested under the same 
conditions of loading and input signals as discussed 
previously for the latch. This testing verifies proper 
function of the flip-flop design and confirms that the flip- 
flop performance parameters of setup time and hold time 
mirror those of the CML latch. However, due to the presence 
of a second latch in the flip-flop, the propagation delays 


are greater. 
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2). Final Design Summary 

The final design for the CML D-type flip-flop is 
essentially the master-slave configuration of two CML 
latches, as illustrated in Figure (4-7). The design 
parameters of the master and slave latches remains the same 
as shown in Table (4-4). The applicable performance 
parameters of the flip-flop have been summarized in Table 


(4-5). 


Flip-Flop 
Design and Performance Summary 


Reference Latch Design Parameters 
Power: 18 mw 


Max 
Fanout Setup Hold torop torop Total 
Load Time Time H-L L-H Delay 
( # gates) (pS) (pS) (pS) (pS) (pS) 
1 33 9 49 25 82 
2 33 9 So 47 86 
5 34 9 52 45 86 
4 35 10 54 43 89 


Table 4-4. Design and Performance Summary of the 
D-type Flip-Flop. 
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Cc. CLOCK DRIVER DESIGN 

ee Overview 

The topology of the clock driver closely resembles that 
of the inverter/buffer circuit. In fact, the only necessary 
modification to the inverter/buffer design is a reduction of 
the output voltage range at the output buffer. Thais eks 
accomplished by a simple voltage divider that effectively 
steps the voltage down to the desired voltage range between 
O27 and 1.2uevo0lts aff 1oquTree4acie This voltage range is 
dictated by the CML latch design. 

Two performance parameters are of particular interest 


in the clock driver design, fanout capability and the 


HBT1x1 HBT 1x1 
CLOCK 
COMPLEMENT Rie 
CLOCK Output CLOCK Output 
(Inverted) 


(Non-Inverted) 





Figure 4-8. Topology of the Clock Driver Circuit. 


Sak 


symmetry of complementary output signals. Increased fanout 
is desirable to reduce the number of clock drivers required. 
Meanwhile, output symmetry is important to reduce clock skew 
between parallel clock paths. The absence of symmetry 
between the complementary output signals of the logic 
circuits (in Chapter III) results from the corresponding 
lack of symmetry between the input signals, i.e. the use of 
a reference voltage. Therefore, the clock driver is driven 
by the differential clock signals CLK and CLK-N. 
2. Analysis and Results 
Fanout capability is maximized by the increase of 
current through the output buffer. Two further 
modifications to the inverter/buffer circuit make this 
possible. The first is to increase the bias current. Fora 
supply voltage of 2.5 volts, a practical current source we: 
2m4 is the largest that is operable without adversely 
biasing the circuit. Second, reducing the total resistance 
in the output buffer draws a larger base current and 
ultimately, more current is available to the output load. 
For evaluation, the performance of two clock 
driver configurations is measured based upon the power 
consumed per load driven. The 1mA clock driver draws 5.5mA 
and consumes 13.8mW while driving a maximum of two latches. 


Meanwhile, the 2mA clock driver draws 6.5mA and consumes 


22 


16.3mWm while draving four iatches. Clearly, the 2mA clock 
driver is the desired implementation. 

Thess svachwuonemicmawiltcmmmc behavior: Of the clock driver 
coupled with its high current consumption warrant an 
investigation of its power rail transient characteristic 
(Figure 4-9). It is not surprising that it follows the same 
periodic trend as discussed in the case of the CML latch. 
In the worst-case, the downward transient current spike 


deviates by 14.6% from the average current level. Also of 


| > Input Signal | 
6. a ! : 
| | 
6.6 Latched : 
: “ i f- ‘ | ib Mon \ if . 
6 tel ae ‘Se a Fel a 1 . Ee sc) Ne Bt 5 oa :) 
a re ie a - ! 
a 6! | | 7 
© 6. i : | | my 
o | | | | 
= 6. | =) | | 
! | | ! | | 
<< 61 ! \ : ! 
6. : OPENING | : 
| | | : 
53 4 4 
NM t ; 
5 : i 
| LATCHING 
2; 
0.0 0.5 1.0 1.5 
Time (ns) 


Figure 4-9. Power Rail transients induced by the 
switching activity of a single Clock Driver. 
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interest is the noise induced on the clocking signal by 
strong, Simultaneous logic transitions at the latch input. 
As a result, a clock driver must be capable of driving a 
maximum fanout load of latches when the every latch input 
transitions simultaneously in the same direction. 

3. Final Design Summary: Clock Driver 

The final design for the clock driver is implemented 
with the parameters listed in Table (4-6) using the topology 


presented previously in Figure (4-8). 


Clock Driver 
Design and Performance Summary 
Rain: 400 Q 
Ribu 110 
R2but: 450 Q 
lise: 2 mA 
NM.: 0.08 v 
NMy: 0.10 v 


Power: 16.3 mw 
Fanout: 4 Latches 


Table 4-6. Design and Performance Summary of the 
Clock Driver Circuit. 


At this point, the set of building blocks is complete. 
The logic circuits of Chapter III and the clock-driven 
devices of Chapter IV are brought together in Chapter V to 


implement several pipelined multiplier configurations. 
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V. HBT CML PIPELINED MULTIPLIER DESIGN 


A. LOGIC STAGE DESIGN 
a. Overview 
As introduced in Chapter II-C, the multiplier logic for 


this project is implemented with the three functional 


processes illustrated in Figure (5-1): partial product 
generation, Carry-Ssave addition, and carry completion 
Multiplier Multiplicand 





Generation 
of 
Partial Product Terms 







Carry-Save 
Addition 





Product 


Figure 5-1. Generalized Block Diagram of an 8x8 bit 
Multitplier. 


addition. In the case of the 8x8 bit multiplier which is 
implemented in this chapter, the process of carry-save 


addition is actually accomplished with successive stages of 
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carry-save adders. More specifically, the use of three-to- 
two carry-save adders produces the logic implementation 
illustrated in Figure (5-2). The detailed process of carry- 
save-addition is addressed in the following section; 
however, this block diagram accurately represents’ the 
functional design of the multiplier and establishes a 
graphic reference for the follow-on discussion. 

Die Carry-Save Adders 

Fach three-to-two carry-save adder takes three operands 
and produces two outputs, a sum and a carry. However, the 
carry-save adder implementations are not identical, due toa 
slightly different input configuration that exists for the 


first carry-save adder stage than for the follow-on stages. 


Referencing Figure (5-3), the first carry-save adder 
receives three non-aligned n-bit partial products. As a 
result, it generates n+2 sum bits and mn carry bits. 


Meanwhile, the follow-on stages each receive an aligned 
input pair comprised of the carry and sum terms generated by 
the preceding stage. The third input is the next partial 
product term, and it is shifted by one bit. Thus, the sum 
1S Only eat wobec wand the carry 15 still mnebits: 

In the case of either carry-save adder, only the most 
Significant n bits of the sum term are passed on to the next 
adder stage. The remaining least significant bit(s) 


represent the next most significant bit(s) of the final 
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Figure 5-2. Logic Implementation of an 8x8 bit Multiplier 
using six stages of Carry-Save-Adders and a Carry-Completion 
Adder. 
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Carry-Save Adder #1 
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C, C, Cs & 3 C, C, 
C[{7:0] 
S[7:0] (x2) 


PP4, PP4, PP4, PP4, PP4, PP4, PP4, PP4, 


Figure 5-3. Functional Illustration of the two Carry- 
Save-Adder Implementations. 


product and are passed directly to the multiplier output. 
These bits are highlighted with a circle in Figure (5-3). 
The final designs of the two carry-save-adder configurations 


are provided in Figures (5-4) and (5-5). Note the presence 
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Figure 5-4. Logic Schematic of Carry-Save-Adder #1. 
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Figure 5-5. Logic Schematic of Carry-Save-Adder #2. 


100 


of more than simple adder circuits. A fanout limitation of 
four prevents a Single signal from driving the eight input 
requirements for the current multiplier bit at each carry- 
save-adder stage. Thus, the arriving multiplier bits pass 
through an inverting buffer stage. 

Furthermore, the OR/NOR gates are used to generate the 
partial product terms within each carry-save-adder stage, 
rather than at the multiplier input. Taking advantage of 
the complementary output signals available from the 
preceding register, the NOR gates perform a logical AND of 
each multiplicand bit with the appropriate multiplier bit. 
Local Generation of the partial product terms avoids the 
extensive requirement for intermediate registers that would 
be necessary to pass all partial product terms from one 
pipeline stage to the next (that is, referencing a scenario 
where all partial products are generated before the first 
carry-save adder). 

3% Carry-Completion Adders 

The carry-completion adder implements ripple-carry 
addition. This elementary design is preferred over carry- 
look-ahead addition because it facilitates a variety of 
Simple pipeline implementations. Figure (5-6) illustrates 
the full carry-completion adder which can be conveniently 
segmented into as many as eight pipeline stages by 


separating the successive two and three-input adders. 


Por 


An 8-bit Ripple-Carry Adder to perform 


Figure 5-6. 


Carry-Completion. 
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Be REGISTER STAGE DESIGN 
Regardless of the number of pipeline stages, each 


multiplier implementation requires two eight-bit input 


registers and a sixteen-bit output register. For pipeline 
implementations with more than one stage, intermediate 
registers ‘are also required. The size of these registers 


varies depending upon where the register is inserted in the 
mlow On logic. All intermediate and output registers 
require complementary input signals. However, the input 
registers are distinctly designed to accept a single logic 
input signal for each bit, vice requiring complementary 
logic input signals. In order to accomplish this, the D- 
type flip-flops utilized in the input register must employ a 
special latch implementation which does not require 
differential input signals for the master latch of the 
master-slave flip-flop pair. The details of this latch 


implementation are presented in Chapter IV-A-5. 


Cz CLOCK DISTRIBUTION 

The purpose of the clock distribution scheme is to 
provide a local clock signal for clock-driven devices, 
namely the latches that comprise the registers described in 
the previous section. However, each clock driver can only 
sustain a maximum load of four latches, i-.e., two flip- 
flops. Therefore, due to the number of clock-driven devices 


and the limited fanout capability of the clock drivers, the 


OS 


clock signal must propagate through an extensive, multi- 
level distribution tree. As the number of clock-driven 
devices increases, the number of levels in this distribution 
tree must eventually increase as well. Thus, the more 
heavily pipelined multiplier implementations must make a 
larger investment ot devices and power in cléeck 


Gust mlout rome 


D. MULTIPLIER IMPLEMENTATIONS 

Five pipelined multiplier implementations have been 
Gesigned for testing via Tanner SPICE simulation tools. 
These implementations include a one-stage pipeline, a two- 
stage pipeline, a four-stage pipeline, a six-stage pipeline, 
and a ten stage pipeline. The arithmetic logic is identical 
for each; however, the increased number of registers present 
in the more heavily pipelined implementations also implies a 
more extensive clock distribution tree. A block diagram of 


each implementation is presented in the following section. 


E. PERFORMANCE EVALUATION 

a. Evaluation Procedures 

Prior to evaluation of the individual multiplies 
implementations, the multiplier logic is successfully tested 
with several operands in order to verify that it produces an 
accurate product. Following this verification, it is the 


goal of this performance evaluation to identify the maximum 
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operating clock frequency for each pipeline implementation. 
However, this can only be done once the critical path, i.e, 
the critical pipeline stage, is determined for each 


multiplier. 


a) Critical Path Identification 

The most direct and absolute means of identifying 
the critical path is to conduct full-length simulations of 
each multiplier for every possible combination and sequence 
of two 8-bit input operands. Conducting these nearly 4.3 
billion simulations on each of the five multiplier designs 
is obviously prohibitive. Thus, the opposite extreme 
suggests that the worst-case transition delay be assumed for 
every logic circuit in every stage of the pipeline. While 
this successfully identifies an upper bound on the delay 
associated with the critical path, it is likely that the 
upper bound case does not exist as a result of two input 
operands. Furthermore, without knowledge of the input 
operands, simulations can not be conducted for verification. 

Unfortunately, the logic behavior of the carry- 
save-adders makes an intuitive approach extremely difficult. 
Thus, a computer program designed by Kirk Shawhan, a 
research associate, has been utilized to identify the worst 
case input combinations. (Shawhan, 2000) The program 
effectively identifies a unique upper bound delay for each 


set of input operands. Those input combinations with the 
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worst-case upper-bound delays are then simulated to identify 
a Single worst-case pair of operands and the critical stage 
where the most-delayed transition occurs. While it is not 
proven that this approach will identify the absolute 
critical path, it provides a reasonable and timely estimate 


for the purposes of this research. 


b) Maximum Throughput/Clocking Frequency 

Having determined the critical path, it is simply 
a matter of simulation time to identify the maximum clock 
frequency. For each pipeline implementation, a simulation 
1s conducted which brackets the breakpoint oof # £the 
multiplier. Furthermore, examination of the margin by which 
the setup time is met or missed provides a determination of 
the minimum clock period that is accurate within five 
picoseconds. 

The increased number of devices in the more 
heavily pipelined designs made full-circuit simulation times 
extremely long. As a result, the breakpoints for the four- 
stage, the six-stage, and the ten-stage multipliers were 
determined from partial simulations. Only the critit¥eam 
stage and those stages immediately before and after it were 
Simulated. 

Dis Performance Results of Each Implementation 
The following ten pages provide a two-page design and 


performance summary for each of the five pipelined 
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multiplier implementations. Figure (5-7) illustrates the 
design and critical path of the one-stage multiplier on a 
block diagram. Table (5-1) provides a summary of data which 
quantifies circuit complexity, power consumption, data 
throughput rate and data latency of the one-stage pipelined 
multiplier. Finally, Figure (5-8) illustrates the success 
and failure of P14, the critical path, at clock frequencies 
below the above the breakpoint of the circuit. 

Similarly, Figures (5-9) through (5-16) and Tables 
(5-2) through (5-5) provide the same performance results for 
the two, four, six, and ten-stage pipelined multipliers, 
respectively. A comparative analysis iS conducted as a 
performance summary in the following section. 

As a final note, all full multiplier simulations are 
conducted using ideal current sources. This decision saves 
numerous Simulation hours without SAGrIEIC ING valid 
transient performance data. A close correspondence has been 
demonstrated between the transient performance of the 
practical and ideal current sources for both the logic and 
the latch designs. Use of the ideal source, however, does 
produce overly optimistic power-consumption data due to the 
absence of power dissipation from the transistors in the 
practical ‘cUmsent scource. Thererote ns he we onmuleeenor Gata 
for current consumption is scaled to accurately represent 


the power consumed in practical implementation. 
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P =1100 0000 0000 0001 


Figure 5-7. One-stage pipelined multiplier 
implementation with an illustration of the 
critical path. 
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STAGE _1 


— 


Voltage (V) 


— 
. 


0.8 


0.7 





Number of Number of Current Power 


Transistors Resistors Amperes Watts 
Logic 3952 2352 1.28 3.20 
Registers 384 320 0.31 0.77 
Clock 126 105 0.19 0.48 
TOTAL 4462 2177 1.78 4.44 


Maximum Throughput: 1.33 GHz 
Latency: 0.75 Nano-second 


Table 5-1. Performance summary for the one-stage 
pipelined multiplier. 
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Figure 5-8. Performance bracket of the minimum period for 


the one-stage pipeline multiplier. 
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Figure 5-9. Two-stage pipelined multiplier 
implementation with an illustration of the 
critical path. 
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STAGE 1 


STAGE 2 


Number of Number of Current Power 


Transistors Resistors Amperes Watts 
Logic 3952 2352 1.28 3.20 
Registers 660 550 0.52 1.31 
Clock 228 190 0.36 0.90 
TOTAL 4840 3092 2.17 5.41 


Maximum Throughput: 2.0 GHz 
Latency: 1.0 Nano-second 


Table 5-2. Performance summary for the two-stage 
pipelined multiplier. 
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Figure 5-10. Performance bracket of the minimum period for 
the two-stage pipeline multiplier. 
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Figure 5-11. Four-stage pipelined multiplier 
implementation with an illustration of the 
critical path. 
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Number of Number of Current Power 

Transistors Resistors Amperes Watts 

Logic 3952 2352 1.28 3.20 
Registers 1272 1060 1.01 2.52 
Clock 438 365 0.68 1.71 
TOTAL 5662 SH TTA 2.97 7.43 

Maximum Throughput: 3.45 GHz 
Latency: 1.16 Nano-seconds 
Table 5-3. Performance summary for the four-stage 


Pipelined multiplier. 
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Figure 5-12. 
for the four-stage pipeline multiplier. 
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Figure 5-13. Six-stage pipelined multiplier 
implementation with an illustration of the 
critical path. 
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Number of Number of Current Power 

Transistors Resistors Amperes Watts 
Logic 3952 2352 1.28 3.20 
Registers 1872 1560 1.49 3.72 
Clock 648 540 1.03 E57. 
TOTAL 6472 4452 3.80 9.49 

Maximum Throughput: 4.35 GHz 
Latency: 1.38 Nano-seconds 
Table 5-4. Performance summary for the six-stage 


pipelined multiplier. 
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Figure 5-14. 


Performance bracket of the minimum period 


for the six-stage pipeline multiplier. 
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Figure 5-15. Ten-stage pipelined multiplier 
implementation with an illustration of the 
critical path. 
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Number of Number of Current Power 


Transistors Resistors Amperes Watts 
Logic 3912 2320 1.28 3.20 
Registers 3240 2700 2.57 6.44 
Clock 1116 930 1.74 4.36 
TOTAL 8268 5950 5.60 13.99 


Maximum Throughput: 5.56 GHz 
Latency: 1.80 Nano-seconds 


Table 5-5. Performance summary for the ten-stage 
pipelined multiplier. 
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Figure 5-16. Performance bracket of the minimum period 
for the ten-stage pipeline multiplier. 
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au Comparative Analysis 

A summary of the performance results for each of the 
five pipelined multiplier implementations is presented in 
Table (5-6). A comparative analysis of these results 
quantifies and confirms the major trade-offs of pipelining 
as they were addressed in Chapter II-B. Figure (5-17) 
illustrates the increase in data throughput as compared to 
the increase in product latency. However, latency is 
generally an acceptable trade-off relative to the primary 


cost drivers of device count and power consumption. 


1 2 4 6 10 

STAGE STAGE STAGE STAGE — STAGE 
Device Count 7239 7932 9439 10924 14218 
Power (Watts) 4.44 5.41 7.43 9.49 13.99 
Latency (nS) 0.75 1.00 1.20 il seis} 1.80 
Maximum Throughput ks 2.00 3.33 4.35 5.56 
(GHz) 
Speed-Power Ratio 0.300 0.370 0.449 0.458 0.397 
(GHz/Watt) 
Normalized 0.66 0.81 0.98 1.00 0.87 
Speed-Power Ratio 


Table 5-6. Comparative Summary of Performance. 
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Figure 5-17. Throughput and Latency as a function of the 
number of pipeline stages. 


Device count and power consumption are quantified in 
Figures (5-18) and (5-19), respectively. As the number of 
pipeline stages increases, the cost rises sharply - driven 
by the need for intermediate registers and an extensive 
clock distribution network. In the one-stage pipeline, the 
registers and clock tree represent only 13% of the total 
device count and consume 28% of the total power. On the 
other end of the spectrum, registers and clock distribution 
in the ten-stage pipeline represent 56% of the total device 


count and consume 77% of the total power. 
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Figure 5-18. Distribution of the Device Count. 
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Figure 5-19. Distribution of Power Consumption. 
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Somewhere between these two extremes there exists an 
optimum pipelined implementation. Dividing the maximum 
throughput of each configuration by the total power that it 
consumes, a figure of merit is calculated which is referred 
to here as a speed-power ratio (for consistency with 
optimization procedures in previous’ chapters). Figure 
(5-21) plots the speed-power ratio as a function of the 
number of pipeline stages. The maximum point on the curve 
indicates that the optimal pipelined multiplier 


implementation employs five or six stages. 
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Figure 5-20. Comparison of Speed-Power Ratio. 
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Thus, having concluded an evaluation of the various 
pipelined multiplier implementations, it remains to consider 
the impact that clock skew has upon these high-speed 
CipeCUa ES. Chapter VI undertakes this discussion in the 


pages that follow. 
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VI. ANALYSIS OF CLOCK SKEW 


A. QUANTIFYING CLOCK SKEW 

Clock skew appears naturally in practical circuits due 
to a variety of physical factors as described in Chapter 
II-A. However, in a typical SPICE simulation, transmission 
delays are not inherent to the process and circuit elements 
are evaluated under ideal, homogeneous operating conditions. 
The effective result is the near elimination of clock skew 
from the simulation environment. 

Clock skew could be introduced artificially; however, 
introducing a known amount of clock skew would have very 
predictable results, such that it can be determined without 
Simulation. Thus, based upon the results of Chapter V a 
Simple numerical analysis is conducted in this chapter which 
provides an illustration of how clock skew impacts pipelined 
architectures and serves as a set of reference data from 
which follow-on research into alternative control techniques 


can measure performance. 


S. ANALYSIS PROCEDURES 
Based upon the definition of skew from Chapter II-A, 


let § represent the maximum delay between two clock 


DEVICE 
Signals after propagation through a single level of clock 
drivers. As illustrated in Figure (6-1), the effect of S._h.e. 


on the clock signal as it propagates through the clock 
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distribution tree is that the clock signal potentially 
accumulates S,... picoseconds of skew at each level. 
Furthermore, any loading differences at the final level of 
the clock distribution will introduce another skew term, 
Ss Thus, the simplified expression to be used for 


analyzing and calculating skew 1S given in Equation (6-1). 
(6-1) S = nx S§S + § 


TOTAL DEVICE LOAD 


where, n = maximum number of levels in the 
clock distribution scheme 


LEVEL 2 


t, | 2 Latch 
LOAD 


LEVEL 1 


CLOCK = 
SIGNAL 


VVVV 
VVVY  VVVV 


Skewlworst case = 3x S DEVICE vy SLoaD 
t, | 4 Latch 
LOAD 


Figure 6-1. Illustration of Clock Skew as it results from 
propagation path delays and loading. 
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An expression for n is derived in Equation (6-2), based upon 


the pipeline implementations from Chapter V. 


(6-2) n = 03,5 || 


32 + 26.4(p-1) 





where, #REG 


p = Number of Pipeline Levels 


For synchronous logic, the timing inequality from 
Chapter II-A is repeated as Equation (6-3). This 
relationship requires that the minimum clock period be 


expanded to account for the increase in skew. 


t 


Log7e + Flip-Flop 


(6-3) ee = t 


The procedure for analysis of clock :skew is simply to 


apply a range of values for S EQumehieme lock Gistribution 


DEVICE 


schemes from Chapter V, using Equation (6-2). Based upon 
Simulation results, the worst-case value for SS, is 
determined to be 6.5 picoseconds. Thus, it 1s possible to 


calculate a worst-case skew value for each incremental value 


Of Sie @S 1t applies to the clock distribution scheme of 
each multiplier implementation. Applying the worst-case 
skew values to Equation (6-3), a new minimum period is 
determined for each multiplier implementation. This is 


ZS 


repeated for values of § ranging from two to twenty 


DEVICE 
picoseconds. A comparative analysis of the results should 
identify/confirm the expectation of an increasingly negative 
impact on the more heavily pipelined architectures. 


Finally, within the stated range of S§S values, a 


DEVICE 


reasonable figure for S§& 1s determined as it might 


DEVICE 
actually occur due to device non-idealities in the 
fabrication process. The approximation of device-induced 


skew (S 


vices) LS Gefined as 20% of the worst-case propagation 


delay for the clock driver circuit and is determined to be 
4.5 picoseconds. This set of data is referenced in the 


figures that follow as “typical skew”. 


| aie RESULTS 
Figure (6-2) provides a plot of the results. The 
values for skew which are referenced in the figures 


represent the values for §S The data clearly confirms 


DEVICE ° 
that the multipliers with throughput rates which are 
obtained as a function of higher clock rates will experience 
the most drastic performance reductions in the presence of 
clock skew. Furthermore, when weighed against the cost of 
power consumption a set of new speed-ratio curves is 
obtained, as shown in Figure (6-3). Thus, the contemporary 


appeal of synchronous pipelined architectures demonstrates a 


severe backlash at high clock rates. 
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Vil. CONCLUSIONS 


The fundamentals of circuit analysis and the principles 
of junction transistor behavior have been applied to design 
an optimal family of current-mode logic devices from InP HBT 
SPICE transistor models. From these building blocks of 
digital logic, an array multiplier has been constructed and 
pipelined into five distinct implementations. Each 
multiplier implementation has been simulated extensively via 
Tanner SPICE in order to identify the respective performance 
characteristics of power consumption and maximum operating 
frequency. 

A comparative analysis of multiplier performance has 
effectively demonstrated the trade-offs of pipelining with 
predictable yet interesting results. The cost of increasing 
throughput by increasing the number of pipeline stages has 
been quantified in terms of device count and power 
Consumption: By maximizing data throughput at the most 
efficient cost in terms of power, the optimal 8x8 bit 
synchronous pipelined multiplier design has been determined 
to be the six-stage implementation, as shown on page 121. 

Finally, in the presence of clock skew, it has been 
demonstrated that the efficiency of synchronous pipelined 
architectures operating at high clock rates is significantly 


reduced. Thus, as device switching frequencies continue to 


lu yas, 


pave the way to faster logic Gircuits, the tate of peace 
throughput will be left behind unless the synchronous logic 
design constraint of clock skew can be overcome. The impact 
of clock skew has been quantified and summarized such that 
1t provides a reference point for further research into 
alternative clocking/control techniques. 

Specifically, it is intended that future research use 
the CML HBT logic family designed in this thesis in order to 
implement the same array multiplier Guleeaiinte using 
asynchronous control techniques. One such endeavor is 
already sare progress as BeCcol. Kirk Shawhan, USMC, 
investigates the use of local completion signals which 
employ request/acknowledge handshake signals to control the 
flow of data vice the use of a global clock signal (Shawhan, 
70 0.0) a Perhaps in time such asynchronous schemes will 
mature into a design methodology that overcomes the obstacle 
of clock skew which now threatens to limit synchronous 


design methodology. 
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